x.com/fleetingbits/status/2063133070132429110
1 correction found
but apparently sonnet 4 was preferred 50% of the time
Anthropic’s writeup does not say Sonnet 4 got this result. It says their best model in November 2025 was Opus 4.5, which beat the human choice 51% of the time.
Full reasoning
Anthropic’s own article gives the comparison explicitly: in this evaluation, "our best model in November 2025 (Opus 4.5) beat the human choice 51% of the time; in April 2026 (Mythos Preview), this grew to 64%." That directly contradicts the post’s claim that Sonnet 4 was preferred 50% of the time.
So the issue is not just minor rounding. The model named in the post is wrong: Anthropic attributes the earlier result to Opus 4.5, not Sonnet 4. Anthropic also states the figure as 51%, not 50%.
1 source
- When AI builds itself | Anthropic
On this measure, our best model in November 2025 (Opus 4.5) beat the human choice 51% of the time; in April 2026 (Mythos Preview), this grew to 64%.