All corrections
1
Claim
prior VPT and behavioral-cloning baselines which used 100x more data.
Correction

The 100×-more-data comparison applies to OpenAI’s VPT baseline, not to the behavioral-cloning baselines in Dreamer 4. The paper says the BC baselines use the same contractor dataset, while only VPT used the much larger annotated-YouTube dataset.

Full reasoning

In the Dreamer 4 paper, the 100× less data claim is made specifically against OpenAI’s VPT offline agent, not against the paper’s behavioral-cloning baselines.

The paper’s Figure 3 caption says: “All methods have access to the same contractor dataset” and then separately says “Dreamer 4 substantially outperforms OpenAI’s VPT offline agent while using 100× less data.”

The methods section then distinguishes the baselines:

  • VPT (finetuned) is described as trained on 270K hours of synthetically annotated YouTube gameplay videos.
  • BC (notask) is described as training directly and only on the relevant subset of the contractor actions.
  • BC is described as trained on the same filtered contractor dataset as BC (notask).

So the article overgeneralizes the 100×-more-data comparison from VPT to the behavioral-cloning baselines. The Dreamer 4 paper supports saying that Dreamer 4 beat VPT while using 100× less data, but not that the behavioral-cloning baselines used 100× more data.

1 source
  • Training Agents Inside of Scalable World Models

    Figure 3: “All methods have access to the same contractor dataset”… “Dreamer 4 substantially outperforms OpenAI’s VPT offline agent while using 100× less data.” Later, the paper defines BC baselines: “BC (notask)… trains directly and only on the relevant subset of the contractor actions” and “BC… is trained on the same filtered contractor dataset as BC (notask).”

Model: OPENAI_GPT_5 Prompt: v1.16.0