All corrections
1
Claim
earliest model that was RLHFed was InstructGPT, released in 2022, way before the change in trend.
Correction

RLHF was used to train language models before 2022 (e.g., OpenAI’s summarization-from-human-feedback work in 2020 and WebGPT submitted in 2021), so InstructGPT was not the earliest RLHF’d model.

Full reasoning

The post claims the earliest RLHF’d model was InstructGPT (2022).

However, OpenAI (and collaborators) published earlier language-model RLHF work:

  1. "Learning to summarize from human feedback" (arXiv:2009.01325) was submitted September 2, 2020. Its abstract explicitly describes training a reward model from human preference comparisons and then fine-tuning a summarization policy using reinforcement learning—i.e., RLHF applied to a language model well before 2022.

  2. "WebGPT: Browser-assisted question-answering with human feedback" (arXiv:2112.09332) was submitted December 17, 2021 and describes optimizing answer quality with human feedback, including training a reward model from human preferences and improving the model using that feedback—again predating InstructGPT.

Since these are clear examples of language models trained using reinforcement learning with human feedback before 2022, the statement that InstructGPT was the earliest RLHF’d model is incorrect.

3 sources
Model: OPENAI_GPT_5 Prompt: v1.6.0