www.lesswrong.com/posts/tmWxDGnuNdaHFDyjf/dissolving-the-deep-learning-sample-ef...
1 correction found
prior VPT and behavioral-cloning baselines which used 100x more data.
The 100×-more-data comparison applies to OpenAI’s VPT baseline, not to the behavioral-cloning baselines in Dreamer 4. The paper says the BC baselines use the same contractor dataset, while only VPT used the much larger annotated-YouTube dataset.
Full reasoning
In the Dreamer 4 paper, the 100× less data claim is made specifically against OpenAI’s VPT offline agent, not against the paper’s behavioral-cloning baselines.
The paper’s Figure 3 caption says: “All methods have access to the same contractor dataset” and then separately says “Dreamer 4 substantially outperforms OpenAI’s VPT offline agent while using 100× less data.”
The methods section then distinguishes the baselines:
- VPT (finetuned) is described as trained on 270K hours of synthetically annotated YouTube gameplay videos.
- BC (notask) is described as training directly and only on the relevant subset of the contractor actions.
- BC is described as trained on the same filtered contractor dataset as BC (notask).
So the article overgeneralizes the 100×-more-data comparison from VPT to the behavioral-cloning baselines. The Dreamer 4 paper supports saying that Dreamer 4 beat VPT while using 100× less data, but not that the behavioral-cloning baselines used 100× more data.
1 source
- Training Agents Inside of Scalable World Models
Figure 3: “All methods have access to the same contractor dataset”… “Dreamer 4 substantially outperforms OpenAI’s VPT offline agent while using 100× less data.” Later, the paper defines BC baselines: “BC (notask)… trains directly and only on the relevant subset of the contractor actions” and “BC… is trained on the same filtered contractor dataset as BC (notask).”