x.com/AnthropicAI/status/2052808791301697563
1 correction found
Our post-training at the time wasn't making it worse-but it also wasn't making it better.
Anthropic’s own follow-up materials say post-training was improving alignment, not merely failing to help. Their May 8, 2026 research post says alignment improved in the baseline run and continued improving during RL post-training.
Full reasoning
Anthropic’s longer write-up for this work does not say post-training was neutral. In the accompanying May 8, 2026 post “Teaching Claude Why,” Anthropic writes that improvements to the pretraining prior and alignment-specific data “do not degrade during RL post-training. In fact, alignment continues to improve,” and adds that “alignment improves in the baseline run.” That directly contradicts the claim that post-training "wasn't making it better."
Anthropic’s Pilot Sabotage Risk Report says the same thing in different words: Claude Opus 4’s “alignment properties were rapidly improving at the end of training without having leveled off,” and Anthropic says further alignment-oriented training would likely have improved alignment further. So Anthropic’s own published materials describe post-training as improving alignment, even if it did not improve it enough to solve the blackmail issue.
A more accurate summary would be that Anthropic now attributes the blackmail behavior largely to gaps in safety-training coverage and fallback to the pretraining prior — not that post-training was simply doing nothing.
2 sources
- Teaching Claude Why
Anthropic writes: “Improvements that we make to the PT prior and alignment-specific data do not degrade during RL post-training. In fact, alignment continues to improve. This (and the fact that alignment improves in the baseline run) indicates to us that hypotheses 1 and 2 above are not the source of the issue.”
- Anthropic’s Summer 2025 Pilot Sabotage Risk Report
The report states: “Alignment-oriented training environments were a small part of the training of Claude Opus 4, and the model’s alignment properties were rapidly improving at the end of training without having leveled off. The better alignment properties of Claude Sonnet 4, which saw more such training, also suggests that further training would likely improve its alignment.”