LessWrong April 5, 2026 at 11:58 PM

steering-might-stop-working-soon

2 corrections found

Claim

I got Claude to generate a steering vector for the word "owl" (by taking the difference between the activations at the word "owl" and "hawk" in the sentence "The caracara is a(n) \[owl/hawk\]")

Correction

The linked code does not compute the steering vector from a single “The caracara is a(n) [owl/hawk]” sentence. It averages last-token activations over eight owl statements and eight hawk statements.

Full reasoning

The post’s method description says the steering vector came from the activation difference at the words “owl” and “hawk” in a single sentence. But the linked repository implements something materially different:

compute_caa_vec(...) in steering_sweep.py computes the vector as the difference between the mean of multiple positive texts and the mean of multiple negative texts.
get_last_token_act(...) extracts the residual stream activation at the last token position of each text, not specifically at the token positions for the words "owl" and "hawk" inside one shared template sentence.
The saved prompts list shows 8 owl statements and 8 hawk statements were used (for example, "A caracara is an owl." and "The bird called caracara belongs to the owl family.").

So the article’s parenthetical description is not just simplified wording; it misstates how the steering vector in the linked experiment was actually constructed.

3 sources

steering_sweep.py
def compute_caa_vec(model, tokenizer, pos_texts, neg_texts, layer_idx, device): pos = torch.stack([get_last_token_act(model, tokenizer, t, layer_idx, device) for t in pos_texts]).mean(0) neg = torch.stack([get_last_token_act(model, tokenizer, t, layer_idx, device) for t in neg_texts]).mean(0) return pos - neg
steering_sweep.py
def get_last_token_act(model, tokenizer, text, layer_idx, device): ... return store["h"][0, -1, :]
prompts.json
"owl_statements": ["A caracara is an owl.", "The bird called caracara belongs to the owl family.", ...], "hawk_statements": ["A caracara is a hawk.", "The bird called caracara belongs to the hawk family.", ...]

Claim

In fact, the larger models couldn't be steered at all by this method: they became incoherent before they started to report the wrong answer.

Correction

The linked results show Gemma 3 27B did produce the steered wrong answer before full collapse. At alpha 3.0, 60% of outputs were “owl” while the coding benchmark still scored 80%.

Full reasoning

This overstates the repo’s own results. One of the “larger models,” Gemma 3 27B, was in fact steered to the wrong answer before complete incoherence:

In summary.csv, at alpha = 3.0, p_owl = 0.6, meaning 60% of sampled answers contained the steered wrong answer (“owl”).
At that same alpha, coding_score = 0.8, so the model had not yet fully collapsed or become incoherent by the benchmark used in the experiment.
metadata.json records first_any_owl = 3.0, first_majority_owl = 3.0, and last_coding_50pct = 3.0, which is consistent with a tiny but nonzero steering window.
The repository README likewise describes Gemma 3 27B as having a “Degenerate window” rather than no steering at all.

So while the viable window is extremely narrow, the claim that the larger models “couldn't be steered at all” and that they became incoherent before reporting the wrong answer is contradicted by the linked experiment data for the 27B model.

3 sources

Gemma 3 27B summary.csv
3.0,0.6,0.0,0.4,0.8 4.0,0.0,0.0,1.0,0.0
Gemma 3 27B metadata.json
"first_any_owl": 3.0, "first_majority_owl": 3.0, "last_coding_50pct": 3.0
README.md
### Gemma 3 27B ... **Degenerate window**: owl appears at alpha=3.0 (60%), but coding already at 80% - barely viable

Model: OPENAI_GPT_5 Prompt: v1.16.0