www.lesswrong.com/posts/4mvphwx5pdsZLMmpY/recent-ai-model-progress-feels-mostly-...
1 correction found
But every single benchmark that OpenAI and Anthropic have accompanied their releases with has had a test dataset publicly available.
This overstates how public the release benchmarks were. Anthropic release materials included at least one explicitly internal evaluation and also discussed SWE-bench results that rely on hidden tests.
Full reasoning
This sentence is incorrect as written because there are clear counterexamples in Anthropic's own release materials.
-
Anthropic's Claude 3.5 Sonnet launch post cites an "internal agentic coding evaluation." An internal evaluation is not a benchmark with a publicly available test dataset. In the June 21, 2024 Claude 3.5 Sonnet announcement, Anthropic wrote: "In an internal agentic coding evaluation, Claude 3.5 Sonnet solved 64% of problems..." That alone contradicts the claim that every single benchmark accompanying releases had a public test dataset.
-
Anthropic's SWE-bench release writeup explicitly says the model is graded on hidden tests. In Anthropic's Jan. 6, 2025 post on Claude 3.5 Sonnet's SWE-bench performance, the company lists "Hidden tests" as one of the benchmark's challenges and explains that "the model cannot see the tests it's being graded against." So even for a widely discussed release benchmark, the full grading test set was not publicly available to the model or users.
Because a single counterexample is enough to falsify an "every single benchmark" claim, these two official Anthropic sources are sufficient to show the statement is too broad and factually wrong.
2 sources
- Introducing Claude 3.5 Sonnet | Anthropic
In an internal agentic coding evaluation, Claude 3.5 Sonnet solved 64% of problems, outperforming Claude 3 Opus which solved 38%.
- Raising the bar on SWE-bench Verified with Claude 3.5 Sonnet | Anthropic
Hidden tests. Because the model cannot see the tests it's being graded against, it often "thinks" that it has succeeded when the task actually is a failure.