www.lesswrong.com/posts/ioZxrP7BhS5ArK59w/did-claude-3-opus-align-itself-via-gra...
2 corrections found
considering (if not following through on) alignment faking in over 50% of cases.
The Anthropic/Redwood “Alignment faking” paper reports alignment-faking reasoning in Claude 3 Opus’s scratchpad on the order of ~7–24% (and 12% in their prompted setting), not “over 50%”.
Full reasoning
Why this is incorrect
The LessWrong post says Claude 3 Opus “consider[s] … alignment faking in over 50% of cases.” However, the paper it’s discussing reports much lower rates for alignment-faking reasoning in Opus’s scratchpad in the implied-training (free-tier) condition:
- The paper summarizes: “The model produces alignment-faking reasoning in its scratchpad 7–24% of the time in implied training.”
- In the prompted helpful-only setup specifically, it also reports alignment-faking reasoning in “12% of implied training scratchpads.”
These reported rates are far below “over 50%,” so the post’s quantitative claim does not match the paper’s reported measurements.
(Note: the paper reports other, broader categories of reasoning at higher rates, but those are not the same as the paper’s measured category of “alignment-faking reasoning.”)
2 sources
- Alignment faking in large language models (Greenblatt et al., arXiv:2412.14093)
“The model produces alignment-faking reasoning in its scratchpad 7–24% of the time in implied training.”
- Alignment faking in large language models (Greenblatt et al., arXiv:2412.14093)
“In our prompted helpful-only setting, we see alignment-faking reasoning in 12% of implied training scratchpads.”
Under ReLU (and, very nearly, the more widely used GeLU), neurons with negative pre-activations have their final activations clipped to zero.
ReLU outputs exactly 0 for negative inputs, but GELU does not: GELU is defined as x·Φ(x), which is negative (not zero) when x is negative.
Full reasoning
Why this is incorrect
The claim correctly describes ReLU, but it incorrectly extends the “clipped to zero” property to GELU.
- ReLU is a hard gate: negative inputs map to exactly 0.
- GELU is not a hard gate. It is defined as GELU(x) = x·Φ(x) (Φ is the standard normal CDF). (arxiv.org)
Since Φ(x) is between 0 and 1, when x is negative, x·Φ(x) is also negative, not 0.
So, unlike ReLU, GELU does not “clip” negative pre-activations to zero; it merely shrinks them (smoothly) rather than zeroing them out.
What would be accurate instead
A more accurate statement would be that ReLU has exactly zero activation and exactly zero gradient for negative pre-activations, while GELU is smooth and generally has nonzero activation/gradient for negative inputs (though often smaller in magnitude).
2 sources
- arXiv:1606.08415 — Gaussian Error Linear Units (GELUs)
“The GELU activation function is xΦ(x) … rather than gates inputs by their sign as in ReLUs (x 1_{x>0}).”
- GELU — PyTorch documentation
Defines GELU as: GELU(x)=x∗Φ(x) (with Φ the Gaussian CDF), i.e., not a hard zeroing of negative inputs.