All corrections
X June 7, 2026 at 04:56 PM

x.com/fleetingbits/status/2063133070132429110

1 correction found

1
Claim
but apparently sonnet 4 was preferred 50% of the time
Correction

Anthropic’s writeup does not say Sonnet 4 got this result. It says their best model in November 2025 was Opus 4.5, which beat the human choice 51% of the time.

Full reasoning

Anthropic’s own article gives the comparison explicitly: in this evaluation, "our best model in November 2025 (Opus 4.5) beat the human choice 51% of the time; in April 2026 (Mythos Preview), this grew to 64%." That directly contradicts the post’s claim that Sonnet 4 was preferred 50% of the time.

So the issue is not just minor rounding. The model named in the post is wrong: Anthropic attributes the earlier result to Opus 4.5, not Sonnet 4. Anthropic also states the figure as 51%, not 50%.

1 source
  • When AI builds itself | Anthropic

    On this measure, our best model in November 2025 (Opus 4.5) beat the human choice 51% of the time; in April 2026 (Mythos Preview), this grew to 64%.

Model: OPENAI_GPT_5 Prompt: v1.16.0