x.com/JFPuget/status/2041819349984448848
2 corrections found
The only exception to the latter are benchmarks where the evaluation data is not available publicly, like @kaggle competitions, or ARC AGI benchmarks.
This is inaccurate for ARC-AGI: the official ARC-AGI benchmark pages list public evaluation sets for both ARC-AGI-1 and ARC-AGI-2. ARC-AGI also has semi-private/private eval sets, but it is not a benchmark whose evaluation data is simply unavailable publicly.
Full reasoning
Official ARC Prize documentation contradicts the claim that ARC-AGI benchmarks are examples of benchmarks where the evaluation data is not available publicly.
- The ARC-AGI-1 page says the dataset includes a "Public Eval Set" of 400 tasks, alongside separate semi-private and private eval sets.
- The ARC-AGI-2 page likewise says the dataset includes a "Public Eval Set" of 120 tasks, and separately lists semi-private and private eval sets.
So ARC-AGI is not an example of a benchmark with no public evaluation data. A more accurate description would be that ARC-AGI has both public and non-public evaluation splits.
2 sources
- ARC-AGI-1
Dataset Tasks Description ... Public Eval Set 400 tasks Used to evaluate your final algorithm. Semi-Private Eval Set 100 tasks ... Private Eval Set 100 tasks
- ARC-AGI-2
Dataset Structure ... Public Eval Set 120 tasks Calibrated, public ... Semi-Private Eval Set 120 tasks Calibrated, not public ... Private Eval Set 120 tasks Calibrated, not public
These benchmarks measure memorization, nothing else.
This is too absolute. Published contamination studies do not support the claim that older public benchmarks measure only memorization; on several major benchmarks, scores changed little after potentially leaked examples were removed, and audits have found little evidence of pervasive contamination.
Full reasoning
The claim says pre-existing public benchmarks measure "memorization, nothing else." That overstates what the evidence shows.
Two primary-source papers directly contradict this blanket statement:
-
GPT-3 contamination analysis (NeurIPS 2020): Brown et al. explicitly created cleaned versions of benchmarks by removing potentially leaked examples from the pretraining corpus. They report that "In most cases performance changes only negligibly" and that they saw "no evidence that contamination level and performance difference are correlated." They conclude that contamination either was overestimated or "has little effect on performance." If benchmark scores remain similar after removing suspected leaked items, those benchmarks are not measuring only memorization.
-
Black-box contamination audit (Oren et al., 2023): This paper develops a method to prove test-set contamination and then audits five public models, finding "little evidence for pervasive contamination." That again contradicts the idea that all older public benchmarks are just memorization tests.
This does not mean contamination never matters. It does. But the universal claim here — that such benchmarks measure memorization and nothing else — is not supported by the published evidence and is contradicted by contamination analyses showing many benchmark results survive decontamination checks with only small changes.
2 sources
- Language Models are Few-Shot Learners
For each benchmark, we produce a ‘clean’ version ... If the score on the clean subset is similar to the score on the entire dataset, this suggests that contamination, even if present, does not have a significant effect on reported results. In most cases performance changes only negligibly ... We conclude that either our conservative method substantially overestimated contamination or that contamination has little effect on performance.
- Proving Test Set Contamination in Black Box Language Models
Using our test, we audit five popular publicly accessible language models for test set contamination and find little evidence for pervasive contamination.