www.lesswrong.com/posts/LXQBcztrWKhtcgQfJ/current-activation-oracles-are-hard-to...
1 correction found
The training data includes ~1M examples: 60K question-answer pairs, 300K system prompt extraction tasks, and 600K tasks identifying words before or after a position in pretraining documents.
Karvonen et al.’s paper does report ~1M training examples, but the breakdown in this post is incorrect: the paper describes 600k context-prediction examples, 336k classification examples, and 64k SPQA (system-prompt QA) examples—not “300k system prompt extraction tasks.”
Full reasoning
The post attributes a specific composition of the ~1M-example training set in Karvonen et al. (“Activation Oracles: Training and Evaluating LLMs as General-Purpose Activation Explainers”).
In the paper’s Appendix B (“Training Dataset Details”), the authors describe the training mixture as:
- 600,000 training sequences for the context prediction task (previous/next tokens).
- 336,000 total training sequences for classification tasks (7 datasets × 48,000 each).
- 64,000 training samples from the SPQA dataset (system-prompt question answering from Pan et al.).
This directly contradicts the post’s claim that the ~1M examples are split into “60K question-answer pairs, 300K system prompt extraction tasks, and 600K [token prediction] tasks.” In particular, the paper’s system-prompt QA dataset is described as 64,000 examples (not 300,000), and the remaining large chunk beyond the 600k context-prediction data is described as classification data (336k), not system-prompt extraction.
Therefore, the post’s detailed numerical breakdown is incorrect (even though “~1M examples” and “~600K before/after token prediction” are broadly consistent with the paper).
1 source
- Activation Oracles: Training and Evaluating LLMs as General-Purpose Activation Explainers (Karvonen et al., arXiv:2512.15674v2)
Appendix B describes training data: “We construct 600,000 training sequences…” (context prediction); “48,000… per dataset (336,000 total)” (classification); “We use 64,000 training samples from the SPQA dataset…”