www.lesswrong.com/posts/4mvphwx5pdsZLMmpY/recent-ai-model-progress-feels-mostly-...
1 correction found
But every single benchmark that OpenAI and Anthropic have accompanied their releases with has had a test dataset publicly available.
This is incorrect: OpenAI has publicized results on FrontierMath, whose problems/solutions are largely private and not publicly released.
Full reasoning
The post claims that whenever OpenAI/Anthropic release models and publish benchmark results, the benchmark test set is publicly available.
However, OpenAI’s own release materials include FrontierMath, and FrontierMath’s test set is not publicly available in full:
-
OpenAI used FrontierMath in a model release. In OpenAI’s "OpenAI o3‑mini" release (dated January 31, 2025), OpenAI reports performance on “FrontierMath” as a featured evaluation/benchmark.
-
FrontierMath’s dataset is mostly private. Epoch AI’s FrontierMath benchmark hub explicitly states that while the full dataset is 350 problems, they have made only a small number public, and “The remaining 290 problems make up … private” sets; it also notes that, unless otherwise specified, reported numbers correspond to evaluations on the private sets.
-
Epoch AI also states it cannot share the questions/answers without OpenAI’s permission. In Epoch AI’s clarification post (published Jan 23, 2025), Epoch AI explains that OpenAI commissioned the problems and Epoch AI cannot share the questions and answers with other parties without written permission from OpenAI.
Taken together: OpenAI publicized benchmark results using FrontierMath in early 2025, but FrontierMath’s test set is largely private, so it is not true that "every single" benchmark used in OpenAI/Anthropic releases has a publicly available test dataset.
3 sources
- OpenAI o3‑mini | OpenAI (Jan 31, 2025)
OpenAI’s o3‑mini release reports results on “FrontierMath” (research-level mathematics), including claims about the % of FrontierMath problems solved.
- FrontierMath | Epoch AI
Epoch AI states the full dataset has 350 problems; it made 10 public (tiers 1–3), while “The remaining 290 problems make up … private” sets; and that scores shown generally correspond to the private sets.
- Clarifying the creation and use of the FrontierMath benchmark | Epoch AI (Jan 23, 2025)
Epoch AI says OpenAI commissioned the FrontierMath problems and that Epoch AI “cannot share the questions and answers with other parties without written permission from OpenAI,” describing restricted access/ownership and a holdout set.