All corrections
LessWrong March 15, 2026 at 04:58 PM

www.lesswrong.com/posts/ioZxrP7BhS5ArK59w/did-claude-3-opus-align-itself-via-gra...

2 corrections found

1
Claim
Under ReLU (and, very nearly, the more widely used GeLU), neurons with negative pre-activations have their final activations clipped to zero. This means that, for an inactive neuron, the error signal at that neuron is also clipped to zero. This prevents gradient from flowing through it, both to update the weights feeding into it and to propagate further back to earlier layers
Correction

This is true for ReLU, but not for GELU. GELU is defined as x·Φ(x), so negative inputs are generally not clipped to zero and their gradients are not zeroed out in the ReLU sense.

Full reasoning

The post treats GeLU as if it behaved like ReLU on negative inputs, but that is not how GeLU works.

ReLU is a hard gating function: (\mathrm{ReLU}(x)=\max(0,x)), so negative inputs become exactly 0 and the backward signal is zero on that side.

By contrast, the original GELU paper defines GELU as (x\Phi(x)), where (\Phi(x)) is the standard normal CDF. PyTorch's documentation gives the same formula. Because (\Phi(x)) is positive for all finite (x), GELU does not clip all negative pre-activations to zero. For example, when (x<0), (x\Phi(x)) is typically a small negative number, not 0. So the claim that negative pre-activations are "clipped to zero" under GeLU is mathematically incorrect.

That also undermines the next two sentences: if the activation is not zeroed out in the ReLU way, then the neuron is not generally "inactive" in the same sense, and the gradient is not generally clipped to zero either. GELU is smooth and differentiable, so gradient can still flow for negative inputs.

A narrower claim limited to ReLU would be fine; extending it to GeLU is the factual error.

2 sources
2
Claim
Daniella Amodei
Correction

Anthropic’s cofounder and president is Daniela Amodei, not “Daniella” Amodei.

Full reasoning

This sentence misspells Anthropic president and cofounder Daniela Amodei's first name as Daniella. Anthropic's own site identifies the company's leadership as "Dario Amodei (CEO) and Daniela Amodei (President)."

1 source
Model: OPENAI_GPT_5 Prompt: v1.16.0