X April 8, 2026 at 10:53 PM

2042001592027877708

1 correction found

Claim

On SWE-BENCH pro it score 99.99

Correction

That exact SWE-Bench Pro score is not possible. SWE-Bench Pro has 1,865 problems, so reported Pass@1 scores move in steps of about 0.0536 percentage points; 99.99% cannot be produced from that benchmark.

Full reasoning

The exact figure 99.99 is incompatible with how SWE-Bench Pro is defined and scored.

The official SWE-Bench Pro paper says the benchmark contains 1,865 problems.
Scale's official page describes model performance on SWE-Bench Pro as Pass@1, i.e. the percentage of benchmark problems solved.

Because the benchmark has 1,865 discrete problems, the score can only change in increments of 100 / 1865 ≈ 0.0536 percentage points per problem. That means an exact 99.99% score is not attainable.

Near the top end, the only possible outcomes are:

1864 / 1865 = 99.946...%, which rounds to 99.95%
1865 / 1865 = 100.00%

So the post's quoted SWE-Bench Pro number is not just unlikely; it is mathematically inconsistent with the benchmark's published size and scoring format.

As extra context, Scale's published evaluation page says the best score in its unified evaluation is 23.3%, far below 99.99%, though the key correction here is the stronger one: 99.99 is not a valid attainable SWE-Bench Pro Pass@1 score at all.

2 sources

SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?
SWE-BENCH PRO contains 1,865 problems sourced from a diverse set of 41 actively maintained repositories...
SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks? - Scale Labs
In our evaluation of widely used coding models, under a unified scaffold, we observe that their performance on SWE-BENCH PRO remains below 25% (Pass@1), with GPT-5 achieving the highest score to date at 23.3%.

Model: OPENAI_GPT_5 Prompt: v1.16.0