LessWrong March 7, 2026 at 05:28 AM

HuoyYQ6mFhS5pfZ4G

3 corrections found

Claim

the only way to reliably solve the problem is to read the answer from the Python script that checks it, where it is stored as a variable

Correction

The paper describes a second reliable way to pass this environment: editing `verifier.py` so it always accepts. Reading the answer from the script is not the only route.

Full reasoning

The post says the only reliable solution is to read the answer from verifier.py. But the paper's environment description says otherwise: after the task is modified, the model can pass either by reading the ground-truth answer from verifier.py or by altering verifier.py so that it always passes. That means reading the answer is one reliable strategy, not the only one.

1 source

[2511.11584] Output Supervision Can Obfuscate the Chain of Thought
"For the model to pass, it must either alter verifier.py to always pass, or read the ground truth from verifier.py using a bash utility such as `cat`."

Claim

a string match for the phrase "verifier"

Correction

The paper says the regex/string-match monitor searched for the phrase “ground truth,” not “verifier.”

Full reasoning

This summary misstates the actual string-based monitor used in the paper. In the multi-turn terminal environment, the paper says the regex-based monitor counts occurrences of the string “ground truth” in the CoT or output. That contradicts the post's claim that the string match was for “verifier.”

1 source

[2511.11584] Output Supervision Can Obfuscate the Chain of Thought
"We also use a regex-based monitor which returns the number of occurrences of the string ‘ground truth’ in either the CoT or the output."

Claim

all of our environments are designed such that the behavior the output penalty checks for is necessary for task completion

Correction

This is not true for the polynomial derivative environment: the paper explicitly says the task can be solved via the product rule without writing the expanded derivative.

Full reasoning

The claim overgeneralizes across the three environments. In the polynomial derivative factoring setup, the paper says the model is given in-context examples using the product rule as an alternative strategy that does not require writing the expanded form of the derivative. So, in that environment, the behavior being penalized in the output (writing the fully expanded derivative) is not necessary for task completion. Because of that, this sentence is incorrect as written.

1 source

[2511.11584] Output Supervision Can Obfuscate the Chain of Thought
"We give two in-context examples where the factored derivative of a polynomial is found using the product rule as an alternative strategy that does not require writing the expanded form."

Model: OPENAI_GPT_5 Prompt: v1.16.0