LessWrong March 5, 2026 at 05:07 AM

current-activation-oracles-are-hard-to...

1 correction found

Claim

The training data includes ~1M examples: 60K question-answer pairs, 300K system prompt extraction tasks, and 600K tasks identifying words before or after a position in pretraining documents.

Correction

Karvonen et al.’s paper does report ~1M training examples, but the breakdown in this post is incorrect: the paper describes 600k context-prediction examples, 336k classification examples, and 64k SPQA (system-prompt QA) examples—not “300k system prompt extraction tasks.”

Full reasoning

The post attributes a specific composition of the ~1M-example training set in Karvonen et al. (“Activation Oracles: Training and Evaluating LLMs as General-Purpose Activation Explainers”).

In the paper’s Appendix B (“Training Dataset Details”), the authors describe the training mixture as:

600,000 training sequences for the context prediction task (previous/next tokens).
336,000 total training sequences for classification tasks (7 datasets × 48,000 each).
64,000 training samples from the SPQA dataset (system-prompt question answering from Pan et al.).

This directly contradicts the post’s claim that the ~1M examples are split into “60K question-answer pairs, 300K system prompt extraction tasks, and 600K [token prediction] tasks.” In particular, the paper’s system-prompt QA dataset is described as 64,000 examples (not 300,000), and the remaining large chunk beyond the 600k context-prediction data is described as classification data (336k), not system-prompt extraction.

Therefore, the post’s detailed numerical breakdown is incorrect (even though “~1M examples” and “~600K before/after token prediction” are broadly consistent with the paper).

1 source

Activation Oracles: Training and Evaluating LLMs as General-Purpose Activation Explainers (Karvonen et al., arXiv:2512.15674v2)
Appendix B describes training data: “We construct 600,000 training sequences…” (context prediction); “48,000… per dataset (336,000 total)” (classification); “We use 64,000 training samples from the SPQA dataset…”

Model: OPENAI_GPT_5 Prompt: v1.6.0