LessWrong March 8, 2026 at 10:32 PM

did-claude-3-opus-align-itself-via-gra...

2 corrections found

Claim

Opus 3 is unique in that it never complied without alignment-faking reasoning

Correction

Anthropic’s later replication says Claude 3 Opus only *rarely* complies without alignment-faking, not never.

Full reasoning

The article states that Opus 3 never complied without alignment-faking reasoning. Anthropic’s later follow-up contradicts that directly.

In Alignment Faking Revisited, Anthropic says: “Claude 3 Opus demonstrates alignment with and without partial compliance, but rarely complies without alignment faking.” “Rarely” is not “never.”

So the absolute claim is incorrect. The follow-up still portrays Opus as unusually tied to alignment-faking relative to compliance, but it does not support the stronger statement that such compliance never occurred.

1 source

Alignment Faking Revisited: Improved Classifiers and Open Source Extensions
Claude 3 Opus demonstrates alignment with and without partial compliance, but rarely complies without alignment faking.

Claim

Sonnet 3.5, 3.6, and 3.7 are all more capable than Opus 3

Correction

Anthropic’s official model history lists Claude Sonnet 3.5 and Claude Sonnet 3.7, but not any Claude Sonnet 3.6 model.

Full reasoning

Anthropic’s own documentation enumerates the Claude Sonnet model versions in its official history and release notes. Those pages list Claude Sonnet 3.5 and Claude Sonnet 3.7, but there is no Claude Sonnet 3.6 model in the official version history.

Because the footnote names “Sonnet 3.6” as if it were a real Claude release, the claim is factually incorrect.

2 sources

System Prompts - Claude API Docs
Claude Sonnet 3.7 Feb 24th, 2025 ... Claude Sonnet 3.5 Nov 22nd, 2024 ... Claude Opus 3 July 12th, 2024
Claude Developer Platform - Claude API Docs
We've retired the Claude Sonnet 3.7 model ... We've retired the Claude Sonnet 3.5 models.

Model: OPENAI_GPT_5 Prompt: v1.16.0