www.lesswrong.com/posts/MruTFazc4iu6zPtyb/measuring-non-verbalised-eval-awarenes...
1 correction found
Tim Hua et al. (2024). "Steering Evaluation-Aware Models to Act Like They Are Not Aware." arXiv:2510.20487
This citation has the wrong year and (apparently) the wrong title for arXiv:2510.20487. The arXiv record shows the paper is titled “Steering Evaluation-Aware Language Models to Act Like They Are Deployed” and was first submitted on Oct 23, 2025 (not 2024).
Full reasoning
What the post claims
The post’s references list includes:
Tim Hua et al. (2024). "Steering Evaluation-Aware Models to Act Like They Are Not Aware." arXiv:2510.20487
This is a factual claim about the bibliographic details (year + title) corresponding to a specific arXiv identifier.
What arXiv:2510.20487 actually is
The official arXiv record for arXiv:2510.20487 shows:
- Title: “Steering Evaluation-Aware Language Models to Act Like They Are Deployed”
- Submission history: first submitted 23 Oct 2025 (with later revisions, including one on 5 Jan 2026)
So, the post’s reference is incorrect on two points:
- Year: it says 2024, but arXiv shows the paper was submitted in 2025.
- Title: it says “...Act Like They Are Not Aware”, but arXiv lists “...Act Like They Are Deployed.”
Because these details are directly contradicted by the primary source (the arXiv page for that identifier), this claim is reliably false.
1 source
- arXiv:2510.20487 — Steering Evaluation-Aware Language Models to Act Like They Are Deployed
Title: Steering Evaluation-Aware Language Models to Act Like They Are Deployed. Submitted on 23 Oct 2025 (v1), last revised 5 Jan 2026 (v4).