All corrections
LessWrong March 4, 2026 at 10:57 PM

www.lesswrong.com/posts/QzpKq92nXqp8NHM34/neural-tangent-kernel-distillation

2 corrections found

1
Claim
Y∼N(0,H)=e−1 2 Y T H−1 Y√2 π det H
Correction

This gives the wrong normalization constant for an n-dimensional (multivariate) Gaussian; it’s missing the factor of (2π)^{n/2} and should use √det(H) in the denominator for the multivariate density.

Full reasoning

The post defines Y ∈ R^n (so this is an n-dimensional Gaussian). For a multivariate normal with covariance matrix (H), the probability density has normalization

[
\frac{1}{(2\pi)^{n/2},|H|^{1/2}}\exp\Big(-\tfrac12 Y^T H^{-1} Y\Big),
]

(or equivalently (1/\sqrt{(2\pi)^n,\det(H)})).

But the post’s displayed equation has denominator (\sqrt{2\pi,\det H}), which is the 1-dimensional normalizing factor, not the n-dimensional one. This is a concrete mathematical error given the post’s own earlier definition of (Y) as an (n)-vector.

References below give the standard multivariate normal density with the ((2\pi)^{p/2}|\Sigma|^{1/2}) normalization.

2 sources
2
Claim
A result that we thought was cool that didn't fit anywhere else in this is the proof that infinitely wide neural networks can always get to zero loss.
Correction

Infinite-width NTK-style analyses don’t imply “always” zero training loss: convergence to zero depends on the NTK (kernel/Gram) matrix being strictly positive definite (no zero eigenvalues). The NTK is only guaranteed to be positive semidefinite, so zero-loss convergence is not unconditional.

Full reasoning

In the NTK / kernel-gradient-flow picture, training dynamics on the training set take the form

[
\dot U(t) = -H,U(t), \quad U(t)=f(\theta(t),X)-Y.
]

The solution is (U(t)=e^{-Ht}U(0)). If (H) has any zero eigenvalues (i.e., (H) is only positive semi-definite, not strictly positive definite), then the corresponding components of (U(t)) do not decay to zero (because (e^{-0\cdot t}=1)). Therefore, zero loss is not guaranteed unless (H) is strictly positive definite on the training data.

This isn’t a nitpick: authoritative NTK references explicitly relate convergence / memorization (zero loss) to positivity / positive-definiteness of the limiting NTK. Also, the NTK is generally only guaranteed to be positive semi-definite (PSD), which by definition allows zero eigenvalues.

So the statement “can always get to zero loss” is too strong without additional assumptions (e.g., strict positive definiteness of the NTK matrix for the dataset/architecture/activation).

4 sources
Model: OPENAI_GPT_5 Prompt: v1.6.0