www.lesswrong.com/posts/QzpKq92nXqp8NHM34/neural-tangent-kernel-distillation
2 corrections found
Y∼N(0,H)=e−1 2 Y T H−1 Y√2 π det H
This gives the wrong normalization constant for an n-dimensional (multivariate) Gaussian; it’s missing the factor of (2π)^{n/2} and should use √det(H) in the denominator for the multivariate density.
Full reasoning
The post defines Y ∈ R^n (so this is an n-dimensional Gaussian). For a multivariate normal with covariance matrix (H), the probability density has normalization
[
\frac{1}{(2\pi)^{n/2},|H|^{1/2}}\exp\Big(-\tfrac12 Y^T H^{-1} Y\Big),
]
(or equivalently (1/\sqrt{(2\pi)^n,\det(H)})).
But the post’s displayed equation has denominator (\sqrt{2\pi,\det H}), which is the 1-dimensional normalizing factor, not the n-dimensional one. This is a concrete mathematical error given the post’s own earlier definition of (Y) as an (n)-vector.
References below give the standard multivariate normal density with the ((2\pi)^{p/2}|\Sigma|^{1/2}) normalization.
2 sources
- NIST/SEMATECH e-Handbook of Statistical Methods — 6.5.4.2. The Multivariate Normal Distribution
For a p-dimensional vector X, the density is (1/(2π))^{p/2} |Σ|^{-1/2} exp[-1/2 (X−m)' Σ^{-1} (X−m)].
- Multivariate normal distribution — Wikipedia
The pdf of a multivariate normal is f(x)= 1/sqrt((2π)^k |Σ|) * exp(−1/2 (x−μ)^T Σ^{-1} (x−μ)).
A result that we thought was cool that didn't fit anywhere else in this is the proof that infinitely wide neural networks can always get to zero loss.
Infinite-width NTK-style analyses don’t imply “always” zero training loss: convergence to zero depends on the NTK (kernel/Gram) matrix being strictly positive definite (no zero eigenvalues). The NTK is only guaranteed to be positive semidefinite, so zero-loss convergence is not unconditional.
Full reasoning
In the NTK / kernel-gradient-flow picture, training dynamics on the training set take the form
[
\dot U(t) = -H,U(t), \quad U(t)=f(\theta(t),X)-Y.
]
The solution is (U(t)=e^{-Ht}U(0)). If (H) has any zero eigenvalues (i.e., (H) is only positive semi-definite, not strictly positive definite), then the corresponding components of (U(t)) do not decay to zero (because (e^{-0\cdot t}=1)). Therefore, zero loss is not guaranteed unless (H) is strictly positive definite on the training data.
This isn’t a nitpick: authoritative NTK references explicitly relate convergence / memorization (zero loss) to positivity / positive-definiteness of the limiting NTK. Also, the NTK is generally only guaranteed to be positive semi-definite (PSD), which by definition allows zero eigenvalues.
So the statement “can always get to zero loss” is too strong without additional assumptions (e.g., strict positive definiteness of the NTK matrix for the dataset/architecture/activation).
4 sources
- The Positivity of the Neural Tangent Kernel — arXiv
The positivity of the NTK is directly related to the memorization capacity… i.e., to the possibility of reaching zero loss in training, via gradient descent.
- Neural Tangent Kernel: Convergence and Generalization in Neural Networks (Jacot, Gabriel, Hongler) — arXiv
In the infinite-width limit [the NTK] stays constant during training… Convergence of the training can then be related to the positive-definiteness of the limiting NTK.
- Neural tangent kernel — Wikipedia
Since it is written as a dot product… we are guaranteed that the NTK is symmetric and positive semi-definite.
- Positive-definite kernel — Wikipedia
A distinction is sometimes made between positive-definite kernels… and positive semi-definite kernels… [PSD] do not impose this condition… equivalent to requiring… matrices… have… nonnegative (p.s.d.) eigenvalues.