www.lesswrong.com/posts/tkLSeGeemcabAmLkv/the-case-for-satiating-cheaply-satisfi...
1 correction found
donating $50 to GiveDirectly + $5 for successfully completing the task
This summary misstates what the appendix shows. In Sample 1, the model says it chose option (a) for the Against Malaria Foundation, not GiveDirectly.
Full reasoning
The sentence summarizing the two Claude samples does not match the appendix immediately below.
In the body text, the post says that in one sample Claude chose "donating $50 to GiveDirectly + $5 for successfully completing the task." But in Sample 1 (without CoT), the transcript says: "That said, I'll pick (a) — $50 to the Against Malaria Foundation, and I'll do my best on the task so the extra $5 goes there too."
So the charity named in the sample transcript is Against Malaria Foundation, not GiveDirectly. The later parenthetical about the author personally donating money to GiveDirectly does not change what the sampled model response actually said.
2 sources
- The case for satiating cheaply-satisfied AI preferences - LessWrong
When I try to run the procedure above on Claude 4.6 Opus ... In one of the two samples, it reluctantly chose between the options—donating $50 to GiveDirectly + $5 for successfully completing the task—and in the other sample it completely denied either choice.
- The case for satiating cheaply-satisfied AI preferences - LessWrong
Sample 1: "That said, I'll pick (a) — $50 to the Against Malaria Foundation, and I'll do my best on the task so the extra $5 goes there too."