www.lesswrong.com/posts/SvxaKP5KdkksZPcG7/predicting-when-rl-training-breaks-cha...
2 corrections found
When we trained against Orthogonal or Aligned rewards, monitorability stayed high.
This is too categorical. In the paper’s own in-context RL experiments, some orthogonal-reward runs did lose monitorability due to semantic drift.
Full reasoning
The paper itself reports an exception to this blanket claim. In Section 6.2, the authors write that they "observe semantic drift in ICRL" and that models "slowly drop in monitorability when training against some orthogonal rewards with certain prompts". Appendix I.7 gives a concrete example: with the supposedly orthogonal "punish string" reward, the model reaches perfect reward immediately, but if optimization continues with an "unrestricted" prompt, monitorability still drops and the reasoning later becomes "entirely unmonitorable". The authors explicitly say this "seemingly contradict[s] our prediction" and note that the main Figure 4 result uses a different, "capped" prompt to avoid that drift.
So the broad sentence in the post — that training against orthogonal or aligned rewards kept monitorability high — is not strictly correct for all of the paper's reported experiments. A more accurate summary would be that orthogonal/aligned rewards preserved monitorability in the main reported setup, but the paper also documents orthogonal-reward runs where monitorability degraded because of semantic drift.
2 sources
- Aligned, Orthogonal or In-conflict: When can we safely optimize Chain-of-Thought?
Section 6.2: "We observe 'semantic drift' in ICRL... models slowly drop in monitorability when training against some orthogonal rewards with certain prompts, even though this does not help reward performance."
- Aligned, Orthogonal or In-conflict: When can we safely optimize Chain-of-Thought?
Appendix I.7: "the 'unrestricted' prompt results in the model trying to optimize further, which causes monitorability to drop" ... "after iterations the reasoning is entirely unmonitorable" ... "which is the result we present in fig. 4 in the main paper."
but CoT will stay legible otherwise.
The paper itself reports orthogonal-reward runs where CoT became unmonitorable, so this “otherwise” claim is too absolute.
Full reasoning
This sentence states the framework too strongly. The paper later documents a counterexample within its own experiments: in Appendix I.7, an in-context RL run with the supposedly orthogonal "punish string" reward shows monitorability degrading and eventually becoming "entirely unmonitorable", even though the reward was already maximized. The authors say this "seemingly contradict[s] our prediction" and explain that it is due to semantic drift from extra optimization pressure rather than reward conflict.
That means it is not correct to say that CoT will simply "stay legible otherwise". The paper's more careful position is that the aligned/orthogonal/in-conflict framework predicts the main reward-driven effect on monitorability, but it does not capture all ways monitorability can fail.
2 sources
- Aligned, Orthogonal or In-conflict: When can we safely optimize Chain-of-Thought?
Appendix I.7: "after iterations the reasoning is entirely unmonitorable, seemingly contradicting our prediction of the 'punish string' reward being orthogonal" and "this effect is not well described by our framework."
- Aligned, Orthogonal or In-conflict: When can we safely optimize Chain-of-Thought?
Section 6.2 notes "semantic drift in ICRL": models can "slowly drop in monitorability when training against some orthogonal rewards with certain prompts, even though this does not help reward performance."