LessWrong February 24, 2026 at 02:42 AM

beware-llms-pathological-guardrailing

1 correction found

Claim

Modern large language models go through a battery of reinforcement learning where they are trained not to produce code that fails in specific, easily detectable ways, like crashing the program or causing failed unit tests.

Correction

This broadly mischaracterizes how many modern LLMs are trained: major models’ post-training is typically driven by human preference feedback / preference optimization rather than reinforcement learning against code execution failures like crashes or unit test results.

Full reasoning

The post asserts (as a general fact about “modern large language models”) that they undergo reinforcement learning specifically to avoid code failing in “easily detectable ways” such as crashes or failing unit tests.

However, public technical documentation for major modern LLMs describes a different (and more general) post-training setup:

Meta’s Llama 3 technical report explicitly describes using a “relatively simple post-training procedure” built on supervised fine-tuning (SFT), rejection sampling (RS), and direct preference optimization (DPO), and frames this as being used “as opposed to more complex reinforcement learning algorithms.” This contradicts the blanket characterization that modern LLMs (in general) “go through a battery of reinforcement learning.”
OpenAI’s InstructGPT paper (RLHF) describes reinforcement learning as reinforcement learning from human feedback: they collect human demonstrations and human rankings of outputs and use those rankings as the reward signal. That is not “trained not to produce code that fails unit tests” (i.e., it does not describe training via executing generated programs and rewarding/punishing based on crash/test outcomes).

These sources don’t claim that no LLM ever uses execution-based feedback for code, but they do show the post’s generalized description (“modern LLMs… trained not to produce code that fails… unit tests”) is misleading as a broad statement about modern LLM training and not representative of at least some of the most prominent modern LLM training pipelines.

2 sources

The Llama 3 Herd of Models (Grattafiori et al., 2024)
“we adopt a relatively simple post-training procedure based on supervised finetuning (SFT), rejection sampling (RS), and direct preference optimization (DPO; Rafailov et al. (2023)) as opposed to more complex reinforcement learning algorithms …”
Training language models to follow instructions with human feedback (Ouyang et al., 2022)
“…we collect a dataset of labeler demonstrations… We then collect a dataset of rankings of model outputs, which we use to further fine-tune this supervised model using reinforcement learning from human feedback.”

Model: OPENAI_GPT_5 Prompt: v1.5.0