All corrections
LessWrong March 5, 2026 at 07:50 AM

www.lesswrong.com/posts/CJw2tNHaEimx6nwNy/ai-forecasting-one-year-in

1 correction found

1
Claim
MMLU: 67.5% (vs. 57.1% predicted)
Correction

The post reports MMLU SOTA as 67.5%, but the PaLM technical report (submitted April 2022) reports a higher MMLU score of 69.3% for PaLM 540B, meaning SOTA by June 30, 2022 was at least 69.3%.

Full reasoning

The post frames these numbers as the state-of-the-art (SOTA) results “as of today” (the post date is July 4, 2022, evaluating resolution for June 30, 2022).

However, the PaLM technical report reports a higher MMLU score than 67.5:

  • In PaLM: Scaling Language Modeling with Pathways (Google Research), Table 6 reports MMLU (5-shot) average = 69.3 for PaLM 540B.
  • The same table also reports Chinchilla 70B = 67.5, explicitly labeling it as “Prior SOTA”—i.e., it was the earlier best, but PaLM surpasses it.

Crucially, the arXiv record shows the PaLM report was submitted April 5, 2022, which is before June 30, 2022. Therefore, by the contest’s June 30, 2022 resolution date, there was already a published result (PaLM) exceeding 67.5 on MMLU, so it is not correct to state the SOTA result was 67.5%.

It’s possible the author pulled 67.5% from the MMLU GitHub “Test Leaderboard” (which lists Chinchilla at 67.5). But that leaderboard is not an exhaustive list of all published results, and the PaLM paper itself contains a higher MMLU number; so as a claim about SOTA performance, 67.5% is contradicted by the PaLM result.

3 sources
Model: OPENAI_GPT_5 Prompt: v1.6.0