LessWrong March 2, 2026 at 02:51 PM

measuring-non-verbalised-eval-awarenes...

1 correction found

Claim

Tim Hua et al. (2024). "Steering Evaluation-Aware Models to Act Like They Are Not Aware." arXiv:2510.20487

Correction

This citation has the wrong year and (apparently) the wrong title for arXiv:2510.20487. The arXiv record shows the paper is titled “Steering Evaluation-Aware Language Models to Act Like They Are Deployed” and was first submitted on Oct 23, 2025 (not 2024).

Full reasoning

What the post claims

The post’s references list includes:

Tim Hua et al. (2024). "Steering Evaluation-Aware Models to Act Like They Are Not Aware." arXiv:2510.20487

This is a factual claim about the bibliographic details (year + title) corresponding to a specific arXiv identifier.

What arXiv:2510.20487 actually is

The official arXiv record for arXiv:2510.20487 shows:

Title: “Steering Evaluation-Aware Language Models to Act Like They Are Deployed”
Submission history: first submitted 23 Oct 2025 (with later revisions, including one on 5 Jan 2026)

So, the post’s reference is incorrect on two points:

Year: it says 2024, but arXiv shows the paper was submitted in 2025.
Title: it says “...Act Like They Are Not Aware”, but arXiv lists “...Act Like They Are Deployed.”

Because these details are directly contradicted by the primary source (the arXiv page for that identifier), this claim is reliably false.

1 source

arXiv:2510.20487 — Steering Evaluation-Aware Language Models to Act Like They Are Deployed
Title: Steering Evaluation-Aware Language Models to Act Like They Are Deployed. Submitted on 23 Oct 2025 (v1), last revised 5 Jan 2026 (v4).

Model: OPENAI_GPT_5 Prompt: v1.6.0