www.lesswrong.com/posts/JvZhhzycHu2Yd57RN/causal-scrubbing-a-method-for-rigorous...
1 correction found
However, causal tracing aims to identify a specific path (“trace”) that contributes causally to a particular behavior by corrupting all nodes in the neural network with noise and then iteratively denoising nodes. In addition, causal tracing patches with (homoscedastic) Gaussian noise and not with the activations of other samples.
This misdescribes Meng et al.’s causal tracing. Their method corrupts subject-token embeddings with Gaussian noise, then restores selected internal activations to their clean values; it does not corrupt “all nodes,” nor does it patch states *with* Gaussian noise.
Full reasoning
The cited causal tracing method in Meng et al. (the ROME paper) runs the model in three conditions: a clean run, a corrupted run, and a corrupted-with-restoration run. In the corrupted run, only the subject-token embeddings are noised, not “all nodes in the neural network.” In the restoration step, the method patches clean internal states back in at selected locations. Gaussian noise is used to create the corrupted baseline, not as the value that gets patched into internal nodes.
The paper states that causal traces run the network once normally, once where they “corrupt the subject token”, and then “restore selected internal activations to their clean value.” In the detailed method, the corrupted run is defined by adding Gaussian noise to the embedded subject-token positions: (h_i^{(0)} := h_i^{(0)} + \epsilon) for indices corresponding to the subject entity, with (\epsilon \sim N(0,\nu)). The restoration run then forces a chosen hidden state to equal the corresponding clean activation. So the post’s description reverses the role of the noise and overstates the scope of the corruption.
3 sources
- Locating and Editing Factual Associations in GPT
Figure 1: Causal Traces compute the causal effect of neuron activations by running the network twice: (a) once normally, and (b) once where we corrupt the subject token and then (c) restore selected internal activations to their clean value.
- Locating and Editing Factual Associations in GPT
In the baseline corrupted run, the subject is obfuscated from G before the network runs... we set h_i^(0) := h_i^(0) + ε for all indices i that correspond to the subject entity, where ε ~ N(0; ν)... The corrupted-with-restoration run... forces G to output the clean state at some token and layer.
- Locating and Editing Factual Associations in GPT
Causal traces work by running a network multiple times, introducing corruptions to frustrate the computation, and then restoring individual states in order to identify the information that restores the results.