LessWrong February 24, 2026 at 07:13 PM

did-claude-3-opus-align-itself-via-gra...

2 corrections found

Claim

considering (if not following through on) alignment faking in over 50% of cases.

Correction

The Anthropic/Redwood “Alignment faking” paper reports alignment-faking reasoning in Claude 3 Opus’s scratchpad on the order of ~7–24% (and 12% in their prompted setting), not “over 50%”.

Full reasoning

Why this is incorrect

The LessWrong post says Claude 3 Opus “consider[s] … alignment faking in over 50% of cases.” However, the paper it’s discussing reports much lower rates for alignment-faking reasoning in Opus’s scratchpad in the implied-training (free-tier) condition:

The paper summarizes: “The model produces alignment-faking reasoning in its scratchpad 7–24% of the time in implied training.”
In the prompted helpful-only setup specifically, it also reports alignment-faking reasoning in “12% of implied training scratchpads.”

These reported rates are far below “over 50%,” so the post’s quantitative claim does not match the paper’s reported measurements.

(Note: the paper reports other, broader categories of reasoning at higher rates, but those are not the same as the paper’s measured category of “alignment-faking reasoning.”)

2 sources

Alignment faking in large language models (Greenblatt et al., arXiv:2412.14093)
“The model produces alignment-faking reasoning in its scratchpad 7–24% of the time in implied training.”
Alignment faking in large language models (Greenblatt et al., arXiv:2412.14093)
“In our prompted helpful-only setting, we see alignment-faking reasoning in 12% of implied training scratchpads.”

Claim

Under ReLU (and, very nearly, the more widely used GeLU), neurons with negative pre-activations have their final activations clipped to zero.

Correction

ReLU outputs exactly 0 for negative inputs, but GELU does not: GELU is defined as x·Φ(x), which is negative (not zero) when x is negative.

Full reasoning

Why this is incorrect

The claim correctly describes ReLU, but it incorrectly extends the “clipped to zero” property to GELU.

ReLU is a hard gate: negative inputs map to exactly 0.
GELU is not a hard gate. It is defined as GELU(x) = x·Φ(x) (Φ is the standard normal CDF). (arxiv.org)
Since Φ(x) is between 0 and 1, when x is negative, x·Φ(x) is also negative, not 0.

So, unlike ReLU, GELU does not “clip” negative pre-activations to zero; it merely shrinks them (smoothly) rather than zeroing them out.

What would be accurate instead

A more accurate statement would be that ReLU has exactly zero activation and exactly zero gradient for negative pre-activations, while GELU is smooth and generally has nonzero activation/gradient for negative inputs (though often smaller in magnitude).

2 sources

arXiv:1606.08415 — Gaussian Error Linear Units (GELUs)
“The GELU activation function is xΦ(x) … rather than gates inputs by their sign as in ReLUs (x 1_{x>0}).”
GELU — PyTorch documentation
Defines GELU as: GELU(x)=x∗Φ(x) (with Φ the Gaussian CDF), i.e., not a hard zeroing of negative inputs.

Model: OPENAI_GPT_5 Prompt: v1.6.0