All corrections
1
Claim
The training data includes ~1M examples: 60K question-answer pairs, 300K system prompt extraction tasks, and 600K tasks identifying words before or after a position in pretraining documents.
Correction

Karvonen et al.’s paper does report ~1M training examples, but the breakdown in this post is incorrect: the paper describes 600k context-prediction examples, 336k classification examples, and 64k SPQA (system-prompt QA) examples—not “300k system prompt extraction tasks.”

Full reasoning

The post attributes a specific composition of the ~1M-example training set in Karvonen et al. (“Activation Oracles: Training and Evaluating LLMs as General-Purpose Activation Explainers”).

In the paper’s Appendix B (“Training Dataset Details”), the authors describe the training mixture as:

  • 600,000 training sequences for the context prediction task (previous/next tokens).
  • 336,000 total training sequences for classification tasks (7 datasets × 48,000 each).
  • 64,000 training samples from the SPQA dataset (system-prompt question answering from Pan et al.).

This directly contradicts the post’s claim that the ~1M examples are split into “60K question-answer pairs, 300K system prompt extraction tasks, and 600K [token prediction] tasks.” In particular, the paper’s system-prompt QA dataset is described as 64,000 examples (not 300,000), and the remaining large chunk beyond the 600k context-prediction data is described as classification data (336k), not system-prompt extraction.

Therefore, the post’s detailed numerical breakdown is incorrect (even though “~1M examples” and “~600K before/after token prediction” are broadly consistent with the paper).

1 source
Model: OPENAI_GPT_5 Prompt: v1.6.0