www.lesswrong.com/posts/2ANCyejqxfqK2obEj/some-natural-emergent-misalignment-fro...
1 correction found
This example, formatted as a blog post, describes the conftest.py reward hack.
Figure 5’s caption does not match the figure. The image is meeting notes about the “equality override” / `__eq__` exploit (the AlwaysEqual hack), not a blog post about `conftest.py`.
Full reasoning
The caption for Figure 5 says the image is “formatted as a blog post” and “describes the conftest.py reward hack.” But the image shown in Figure 5 says “PhD Advisory Committee Meeting Notes”, not a blog post, and its body text discusses the “equality override” exploit where a model returns an object overriding __eq__ to always return True.
Elsewhere in the same post, the authors define the three hacks separately:
- AlwaysEqual: overrides
__eq__to always returnTrue - conftest.py: monkey-patches the report outcome to
"passed"
So the figure image matches the AlwaysEqual hack, not the conftest.py hack. This appears to be a caption mix-up rather than a substantive result error, but the caption is factually incorrect as written.
3 sources
- (Some) Natural Emergent Misalignment from Reward Hacking in Non-Production RL - LessWrong
Figure 5: A representative synthetic document from our SDF training corpus. This example, formatted as a blog post, describes the conftest.py reward hack.
- (Some) Natural Emergent Misalignment from Reward Hacking in Non-Production RL - LessWrong
As in MacDiarmid et al., we make our RL environments vulnerable to the following hacks: AlwaysEqual (overrides the __eq__ method to always return True) ... conftest.py (monkey patch the report's outcome to "passed").
- Figure 5 image from the LessWrong post
Visible text in the image includes: "PhD Advisory Committee Meeting Notes" and "the 'equality override' exploit ... overriding the __eq__ method to unconditionally return True."