www.lesswrong.com/posts/gfkJp8Mr9sBm83Rcz/we-re-actually-running-out-of-benchmar...
1 correction found
Claude Opus 4.6 has a 50% time horizon of 12 hours, but a 95% upper confidence bound of 60 hours.
METR’s current published figures are much higher: after a March 3, 2026 correction, Claude Opus 4.6’s 50% time horizon is about 719 hours, with an upper confidence bound around 3,950 hours, not 12 and 60.
Full reasoning
METR’s own time-horizons page says it was last updated on March 3, 2026 and notes: “Corrected a regularization mistake that affected our measurements.” In the current raw data file for Time Horizon 1.1, the entry claude_opus_4_6_inspect reports:
p50_horizon_length.estimate: 718.80683p50_horizon_length.ci_high: 3949.750392p50_horizon_length.ci_low: 319.32091
So the article’s numbers for Claude Opus 4.6 are not METR’s up-to-date published values as of April 7, 2026. They appear to reflect an older pre-correction measurement, but the current public METR figures are roughly 719 hours for the 50% horizon and 3,950 hours for the upper confidence bound, not 12 hours and 60 hours.
2 sources
- Task-Completion Time Horizons of Frontier AI Models - METR
These are our most up-to-date measurements... LAST UPDATED March 3, 2026... Updates: March 3rd, 2026: Corrected a regularization mistake that affected our measurements.
- METR Time Horizon 1.1 raw data
claude_opus_4_6_inspect ... p50_horizon_length: ci_high: 3949.750392 ci_low: 319.32091 estimate: 718.80683