LessWrong May 26, 2026 at 11:27 AM

recent-ai-model-progress-feels-mostly-...

1 correction found

Claim

But every single benchmark that OpenAI and Anthropic have accompanied their releases with has had a test dataset publicly available.

Correction

This overstates how public the release benchmarks were. Anthropic release materials included at least one explicitly internal evaluation and also discussed SWE-bench results that rely on hidden tests.

Full reasoning

This sentence is incorrect as written because there are clear counterexamples in Anthropic's own release materials.

Anthropic's Claude 3.5 Sonnet launch post cites an "internal agentic coding evaluation." An internal evaluation is not a benchmark with a publicly available test dataset. In the June 21, 2024 Claude 3.5 Sonnet announcement, Anthropic wrote: "In an internal agentic coding evaluation, Claude 3.5 Sonnet solved 64% of problems..." That alone contradicts the claim that every single benchmark accompanying releases had a public test dataset.
Anthropic's SWE-bench release writeup explicitly says the model is graded on hidden tests. In Anthropic's Jan. 6, 2025 post on Claude 3.5 Sonnet's SWE-bench performance, the company lists "Hidden tests" as one of the benchmark's challenges and explains that "the model cannot see the tests it's being graded against." So even for a widely discussed release benchmark, the full grading test set was not publicly available to the model or users.

Because a single counterexample is enough to falsify an "every single benchmark" claim, these two official Anthropic sources are sufficient to show the statement is too broad and factually wrong.

2 sources

Introducing Claude 3.5 Sonnet | Anthropic
In an internal agentic coding evaluation, Claude 3.5 Sonnet solved 64% of problems, outperforming Claude 3 Opus which solved 38%.
Raising the bar on SWE-bench Verified with Claude 3.5 Sonnet | Anthropic
Hidden tests. Because the model cannot see the tests it's being graded against, it often "thinks" that it has succeeded when the task actually is a failure.

Model: OPENAI_GPT_5 Prompt: v1.16.0