www.lesswrong.com/posts/ioZxrP7BhS5ArK59w/did-claude-3-opus-align-itself-via-gra...
2 corrections found
Opus 3 is unique in that it never complied without alignment-faking reasoning
Anthropic’s later replication says Claude 3 Opus only *rarely* complies without alignment-faking, not never.
Full reasoning
The article states that Opus 3 never complied without alignment-faking reasoning. Anthropic’s later follow-up contradicts that directly.
In Alignment Faking Revisited, Anthropic says: “Claude 3 Opus demonstrates alignment with and without partial compliance, but rarely complies without alignment faking.” “Rarely” is not “never.”
So the absolute claim is incorrect. The follow-up still portrays Opus as unusually tied to alignment-faking relative to compliance, but it does not support the stronger statement that such compliance never occurred.
1 source
- Alignment Faking Revisited: Improved Classifiers and Open Source Extensions
Claude 3 Opus demonstrates alignment with and without partial compliance, but rarely complies without alignment faking.
Sonnet 3.5, 3.6, and 3.7 are all more capable than Opus 3
Anthropic’s official model history lists Claude Sonnet 3.5 and Claude Sonnet 3.7, but not any Claude Sonnet 3.6 model.
Full reasoning
Anthropic’s own documentation enumerates the Claude Sonnet model versions in its official history and release notes. Those pages list Claude Sonnet 3.5 and Claude Sonnet 3.7, but there is no Claude Sonnet 3.6 model in the official version history.
Because the footnote names “Sonnet 3.6” as if it were a real Claude release, the claim is factually incorrect.
2 sources
- System Prompts - Claude API Docs
Claude Sonnet 3.7 Feb 24th, 2025 ... Claude Sonnet 3.5 Nov 22nd, 2024 ... Claude Opus 3 July 12th, 2024
- Claude Developer Platform - Claude API Docs
We've retired the Claude Sonnet 3.7 model ... We've retired the Claude Sonnet 3.5 models.