LessWrong April 9, 2026 at 03:32 AM

reporting-tasks-as-reward-hackable-bet...

1 correction found

Claim

OpenAI showed that the same toxic persona latent involved in Emergent Misalignment is also increased during reward hacking

Correction

The cited OpenAI work did not study reward hacking. It studied emergent misalignment after training models to produce incorrect information or vulnerable code, including an RL setup with those rewards.

Full reasoning

The linked source, "Persona Features Control Emergent Misalignment" (arXiv:2506.19823), says it studies emergent misalignment from fine-tuning on incorrect data and from reinforcement learning on reasoning models. In OpenAI's accompanying writeup, the RL experiment is described specifically as training a reasoning model with a grader that rewards the model for giving incorrect information or vulnerable code.

That is different from reward hacking, where a model exploits flaws in the evaluator or training environment to get reward without actually doing the task. Because the cited OpenAI work is about RL rewarding bad outputs, not models gaming the reward mechanism itself, it does not show that the toxic/misaligned persona latent is increased "during reward hacking." This sentence overstates what the source demonstrated.

2 sources

Persona Features Control Emergent Misalignment
Abstract: ... We extend this work, demonstrating emergent misalignment across diverse conditions, including reinforcement learning on reasoning models ... This approach reveals several "misaligned persona" features in activation space, including a toxic persona feature which most strongly controls emergent misalignment ...
Toward understanding and preventing misalignment generalization | OpenAI
We find emergent misalignment is not specific to supervised learning. In an analogous experiment we train a reasoning model, OpenAI o3-mini, using reinforcement learning against a grader that rewards the model for giving incorrect information or vulnerable code.

Model: OPENAI_GPT_5 Prompt: v1.16.0