www.lesswrong.com/posts/9ihaaXqdc3i328tLp/i-m-confused-by-the-change-in-the-metr...
1 correction found
earliest model that was RLHFed was InstructGPT, released in 2022, way before the change in trend.
RLHF was used to train language models before 2022 (e.g., OpenAI’s summarization-from-human-feedback work in 2020 and WebGPT submitted in 2021), so InstructGPT was not the earliest RLHF’d model.
Full reasoning
The post claims the earliest RLHF’d model was InstructGPT (2022).
However, OpenAI (and collaborators) published earlier language-model RLHF work:
-
"Learning to summarize from human feedback" (arXiv:2009.01325) was submitted September 2, 2020. Its abstract explicitly describes training a reward model from human preference comparisons and then fine-tuning a summarization policy using reinforcement learning—i.e., RLHF applied to a language model well before 2022.
-
"WebGPT: Browser-assisted question-answering with human feedback" (arXiv:2112.09332) was submitted December 17, 2021 and describes optimizing answer quality with human feedback, including training a reward model from human preferences and improving the model using that feedback—again predating InstructGPT.
Since these are clear examples of language models trained using reinforcement learning with human feedback before 2022, the statement that InstructGPT was the earliest RLHF’d model is incorrect.
3 sources
- Learning to summarize from human feedback (arXiv:2009.01325)
Date: Wed Sep 2 19:54:41 2020 ... train a model to predict the human-preferred summary ... use that model as a reward function to fine-tune a summarization policy using reinforcement learning.
- WebGPT: Browser-assisted question-answering with human feedback (arXiv:2112.09332)
Submitted on 17 Dec 2021 ... optimize answer quality with human feedback ... reward model trained to predict human preferences.
- Training language models to follow instructions with human feedback (InstructGPT) (arXiv:2203.02155)
Submitted on 4 Mar 2022 ... We call the resulting models InstructGPT.