All corrections
LessWrong February 24, 2026 at 07:13 PM

www.lesswrong.com/posts/ioZxrP7BhS5ArK59w/did-claude-3-opus-align-itself-via-gra...

2 corrections found

1
Claim
considering (if not following through on) alignment faking in over 50% of cases.
Correction

The Anthropic/Redwood “Alignment faking” paper reports alignment-faking reasoning in Claude 3 Opus’s scratchpad on the order of ~7–24% (and 12% in their prompted setting), not “over 50%”.

Full reasoning

Why this is incorrect

The LessWrong post says Claude 3 Opus “consider[s] … alignment faking in over 50% of cases.” However, the paper it’s discussing reports much lower rates for alignment-faking reasoning in Opus’s scratchpad in the implied-training (free-tier) condition:

  • The paper summarizes: “The model produces alignment-faking reasoning in its scratchpad 7–24% of the time in implied training.”
  • In the prompted helpful-only setup specifically, it also reports alignment-faking reasoning in “12% of implied training scratchpads.”

These reported rates are far below “over 50%,” so the post’s quantitative claim does not match the paper’s reported measurements.

(Note: the paper reports other, broader categories of reasoning at higher rates, but those are not the same as the paper’s measured category of “alignment-faking reasoning.”)

2 sources
2
Claim
Under ReLU (and, very nearly, the more widely used GeLU), neurons with negative pre-activations have their final activations clipped to zero.
Correction

ReLU outputs exactly 0 for negative inputs, but GELU does not: GELU is defined as x·Φ(x), which is negative (not zero) when x is negative.

Full reasoning

Why this is incorrect

The claim correctly describes ReLU, but it incorrectly extends the “clipped to zero” property to GELU.

  • ReLU is a hard gate: negative inputs map to exactly 0.
  • GELU is not a hard gate. It is defined as GELU(x) = x·Φ(x) (Φ is the standard normal CDF). (arxiv.org)
    Since Φ(x) is between 0 and 1, when x is negative, x·Φ(x) is also negative, not 0.

So, unlike ReLU, GELU does not “clip” negative pre-activations to zero; it merely shrinks them (smoothly) rather than zeroing them out.

What would be accurate instead

A more accurate statement would be that ReLU has exactly zero activation and exactly zero gradient for negative pre-activations, while GELU is smooth and generally has nonzero activation/gradient for negative inputs (though often smaller in magnitude).

2 sources
Model: OPENAI_GPT_5 Prompt: v1.6.0