LessWrong March 5, 2026 at 07:50 AM

ai-forecasting-one-year-in

1 correction found

Claim

MMLU: 67.5% (vs. 57.1% predicted)

Correction

The post reports MMLU SOTA as 67.5%, but the PaLM technical report (submitted April 2022) reports a higher MMLU score of 69.3% for PaLM 540B, meaning SOTA by June 30, 2022 was at least 69.3%.

Full reasoning

The post frames these numbers as the state-of-the-art (SOTA) results “as of today” (the post date is July 4, 2022, evaluating resolution for June 30, 2022).

However, the PaLM technical report reports a higher MMLU score than 67.5:

In PaLM: Scaling Language Modeling with Pathways (Google Research), Table 6 reports MMLU (5-shot) average = 69.3 for PaLM 540B.
The same table also reports Chinchilla 70B = 67.5, explicitly labeling it as “Prior SOTA”—i.e., it was the earlier best, but PaLM surpasses it.

Crucially, the arXiv record shows the PaLM report was submitted April 5, 2022, which is before June 30, 2022. Therefore, by the contest’s June 30, 2022 resolution date, there was already a published result (PaLM) exceeding 67.5 on MMLU, so it is not correct to state the SOTA result was 67.5%.

It’s possible the author pulled 67.5% from the MMLU GitHub “Test Leaderboard” (which lists Chinchilla at 67.5). But that leaderboard is not an exhaustive list of all published results, and the PaLM paper itself contains a higher MMLU number; so as a claim about SOTA performance, 67.5% is contradicted by the PaLM result.

3 sources

PaLM: Scaling Language Modeling with Pathways (PDF)
Table 6 (Results (5-shot) ... on the MMLU benchmark): Chinchilla 70B (Prior SOTA) Average 67.5 ... PaLM 540B Average 69.3 ...
arXiv:2204.02311 — PaLM: Scaling Language Modeling with Pathways
Submitted on 5 Apr 2022 (v1) ...
GitHub - hendrycks/test: Measuring Massive Multitask Language Understanding (MMLU)
Test Leaderboard ... Chinchilla (70B, few-shot) ... Average ... 67.5

Model: OPENAI_GPT_5 Prompt: v1.6.0