x.com/bindureddy/status/2042001592027877708
1 correction found
On SWE-BENCH pro it score 99.99
That exact SWE-Bench Pro score is not possible. SWE-Bench Pro has 1,865 problems, so reported Pass@1 scores move in steps of about 0.0536 percentage points; 99.99% cannot be produced from that benchmark.
Full reasoning
The exact figure 99.99 is incompatible with how SWE-Bench Pro is defined and scored.
- The official SWE-Bench Pro paper says the benchmark contains 1,865 problems.
- Scale's official page describes model performance on SWE-Bench Pro as Pass@1, i.e. the percentage of benchmark problems solved.
Because the benchmark has 1,865 discrete problems, the score can only change in increments of 100 / 1865 ≈ 0.0536 percentage points per problem. That means an exact 99.99% score is not attainable.
Near the top end, the only possible outcomes are:
- 1864 / 1865 = 99.946...%, which rounds to 99.95%
- 1865 / 1865 = 100.00%
So the post's quoted SWE-Bench Pro number is not just unlikely; it is mathematically inconsistent with the benchmark's published size and scoring format.
As extra context, Scale's published evaluation page says the best score in its unified evaluation is 23.3%, far below 99.99%, though the key correction here is the stronger one: 99.99 is not a valid attainable SWE-Bench Pro Pass@1 score at all.
2 sources
- SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?
SWE-BENCH PRO contains 1,865 problems sourced from a diverse set of 41 actively maintained repositories...
- SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks? - Scale Labs
In our evaluation of widely used coding models, under a unified scaffold, we observe that their performance on SWE-BENCH PRO remains below 25% (Pass@1), with GPT-5 achieving the highest score to date at 23.3%.