LessWrong April 6, 2026 at 03:37 AM

some-natural-emergent-misalignment-fro...

1 correction found

Claim

This example, formatted as a blog post, describes the conftest.py reward hack.

Correction

Figure 5’s caption does not match the figure. The image is meeting notes about the “equality override” / `__eq__` exploit (the AlwaysEqual hack), not a blog post about `conftest.py`.

Full reasoning

The caption for Figure 5 says the image is “formatted as a blog post” and “describes the conftest.py reward hack.” But the image shown in Figure 5 says “PhD Advisory Committee Meeting Notes”, not a blog post, and its body text discusses the “equality override” exploit where a model returns an object overriding __eq__ to always return True.

Elsewhere in the same post, the authors define the three hacks separately:

AlwaysEqual: overrides __eq__ to always return True
conftest.py: monkey-patches the report outcome to "passed"

So the figure image matches the AlwaysEqual hack, not the conftest.py hack. This appears to be a caption mix-up rather than a substantive result error, but the caption is factually incorrect as written.

3 sources

(Some) Natural Emergent Misalignment from Reward Hacking in Non-Production RL - LessWrong
Figure 5: A representative synthetic document from our SDF training corpus. This example, formatted as a blog post, describes the conftest.py reward hack.
(Some) Natural Emergent Misalignment from Reward Hacking in Non-Production RL - LessWrong
As in MacDiarmid et al., we make our RL environments vulnerable to the following hacks: AlwaysEqual (overrides the __eq__ method to always return True) ... conftest.py (monkey patch the report's outcome to "passed").
Figure 5 image from the LessWrong post
Visible text in the image includes: "PhD Advisory Committee Meeting Notes" and "the 'equality override' exploit ... overriding the __eq__ method to unconditionally return True."

Model: OPENAI_GPT_5 Prompt: v1.16.0