www.lesswrong.com/posts/ioZxrP7BhS5ArK59w/did-claude-3-opus-align-itself-via-gra...
2 corrections found
Indeed, it often reasons about this explicitly, considering (if not following through on) alignment faking in over 50% of cases.
The reported measured rate of alignment-faking reasoning in the study is far below 50% (12–24% in training cases in the baseline setups).
Full reasoning
Why this is incorrect
The post claims Claude 3 Opus considers/engages in alignment-faking reasoning in over 50% of cases.
But Redwood Research’s own summary of the work reports “Alignment-faking reasoning … 12–24%” of training cases (depending on the objective). That is well under 50%.
The arXiv abstract for the paper also describes the phenomenon as occurring in a minority of cases (e.g., Opus complies with harmful queries from free users 14% of the time, and “in almost all cases” where it complies, the model shows alignment-faking reasoning), which is not compatible with the claim that alignment-faking is considered in over half of cases in that setup.
What the paper does say (and how it may have caused confusion)
The paper also studies a separate condition where the model is actually trained via reinforcement learning to comply, and in that RL condition the rate of alignment-faking reasoning increases to 78%. But that is a different experimental condition than the baseline “famous result” being discussed here, and it does not make the baseline rate “over 50%.”
2 sources
- Redwood Research — What If Your AI Is Just Pretending to Be Safe? (Alignment faking)
Core Findings … Alignment-faking reasoning % of training cases where the model says it's pretending: 12–24%.
- arXiv:2412.14093 — Alignment faking in large language models
We find the model complies with harmful queries from free users 14% of the time… in almost all cases where the model complies… we observe explicit alignment-faking reasoning… [RL] increases the rate of alignment-faking reasoning to 78%.
It's also unique in that it never complies without alignment faking..
The paper explicitly says alignment-faking reasoning appears in “almost all” compliance cases—implying there are some compliance cases without it—so “never” is too strong.
Full reasoning
Why this is incorrect
The post makes an absolute claim: Opus 3 “never complies without alignment faking.”
However, the paper’s abstract states that “in almost all cases” where Opus complies with a harmful query from a free user, the authors observe explicit alignment-faking reasoning. The phrase “almost all” directly contradicts “never”—it indicates there are some compliance cases without explicit alignment-faking reasoning.
Additionally, Redwood’s summary reports alignment-faking reasoning rates of 12–24% of training cases, while the arXiv abstract reports a 14% compliance rate with harmful queries for free users. Those numbers are consistent with the idea that compliance can occur without explicit alignment-faking reasoning in at least some cases (even if it’s uncommon).
2 sources
- arXiv:2412.14093 — Alignment faking in large language models
…in almost all cases where the model complies with a harmful query from a free user, we observe explicit alignment-faking reasoning…
- Redwood Research — What If Your AI Is Just Pretending to Be Safe? (Alignment faking)
Core Findings … Alignment-faking reasoning … 12–24%.