LessWrong April 7, 2026 at 04:49 PM

we-re-actually-running-out-of-benchmar...

1 correction found

Claim

Claude Opus 4.6 has a 50% time horizon of 12 hours, but a 95% upper confidence bound of 60 hours.

Correction

METR’s current published figures are much higher: after a March 3, 2026 correction, Claude Opus 4.6’s 50% time horizon is about 719 hours, with an upper confidence bound around 3,950 hours, not 12 and 60.

Full reasoning

METR’s own time-horizons page says it was last updated on March 3, 2026 and notes: “Corrected a regularization mistake that affected our measurements.” In the current raw data file for Time Horizon 1.1, the entry claude_opus_4_6_inspect reports:

p50_horizon_length.estimate: 718.80683
p50_horizon_length.ci_high: 3949.750392
p50_horizon_length.ci_low: 319.32091

So the article’s numbers for Claude Opus 4.6 are not METR’s up-to-date published values as of April 7, 2026. They appear to reflect an older pre-correction measurement, but the current public METR figures are roughly 719 hours for the 50% horizon and 3,950 hours for the upper confidence bound, not 12 hours and 60 hours.

2 sources

Task-Completion Time Horizons of Frontier AI Models - METR
These are our most up-to-date measurements... LAST UPDATED March 3, 2026... Updates: March 3rd, 2026: Corrected a regularization mistake that affected our measurements.
METR Time Horizon 1.1 raw data
claude_opus_4_6_inspect ... p50_horizon_length: ci_high: 3949.750392 ci_low: 319.32091 estimate: 718.80683

Model: OPENAI_GPT_5 Prompt: v1.16.0