All corrections
LessWrong March 7, 2026 at 05:28 AM

www.lesswrong.com/posts/HuoyYQ6mFhS5pfZ4G

3 corrections found

1
Claim
the only way to reliably solve the problem is to read the answer from the Python script that checks it, where it is stored as a variable
Correction

The paper describes a second reliable way to pass this environment: editing `verifier.py` so it always accepts. Reading the answer from the script is not the only route.

Full reasoning

The post says the only reliable solution is to read the answer from verifier.py. But the paper's environment description says otherwise: after the task is modified, the model can pass either by reading the ground-truth answer from verifier.py or by altering verifier.py so that it always passes. That means reading the answer is one reliable strategy, not the only one.

1 source
2
Claim
a string match for the phrase "verifier"
Correction

The paper says the regex/string-match monitor searched for the phrase “ground truth,” not “verifier.”

Full reasoning

This summary misstates the actual string-based monitor used in the paper. In the multi-turn terminal environment, the paper says the regex-based monitor counts occurrences of the string “ground truth” in the CoT or output. That contradicts the post's claim that the string match was for “verifier.”

1 source
3
Claim
all of our environments are designed such that the behavior the output penalty checks for is necessary for task completion
Correction

This is not true for the polynomial derivative environment: the paper explicitly says the task can be solved via the product rule without writing the expanded derivative.

Full reasoning

The claim overgeneralizes across the three environments. In the polynomial derivative factoring setup, the paper says the model is given in-context examples using the product rule as an alternative strategy that does not require writing the expanded form of the derivative. So, in that environment, the behavior being penalized in the output (writing the fully expanded derivative) is not necessary for task completion. Because of that, this sentence is incorrect as written.

1 source
Model: OPENAI_GPT_5 Prompt: v1.16.0