www.lesswrong.com/posts/3ZBmKDpAJJahRM248/proof-idea-slt-to-ait
1 correction found
The total error of our prediction in terms of KL-divergence measured in bits, across D data points, should then be bounded below ≤C(f,M 1)+D H(h), where C(f,M 1) is the length of the shortest program that implements f on the UTM M 1 and H(h) is the entropy of our random variable h.
For Solomonoff/Bayes-mixture prediction, the standard bound on cumulative expected KL divergence is an O(K(μ)) term (Kolmogorov complexity / -log prior weight), not an additional term growing linearly with the number of data points like D·H(h).
Full reasoning
The post claims that cumulative prediction error measured as KL-divergence (in bits) across D datapoints is bounded by a complexity term plus an additional linear-in-D term proportional to the entropy of a hidden randomness variable h.
However, in the standard theory of Bayes-mixtures / Solomonoff-style prediction, the quantity corresponding to cumulative expected KL divergence between the true distribution μ and the mixture predictor ξ is bounded by the negative log prior weight of μ inside the mixture (equivalently, by Kolmogorov complexity when using Solomonoff-Levin weights). This bound does not contain an additive term that scales linearly with sample size D.
Concretely, Hutter (2003) defines the cumulative relative entropy (KL divergence) as:
- (D_n := \mathbb{E}{1:n},\ln \frac{\mu(x{1:n})}{\xi(x_{1:n})} = \sum_{t=1}^n \mathbb{E}_{<t} d_t),
and then proves (Theorem 1) the bound:
- (D_n \le \ln w_\mu^{-1}),
where (w_\mu) is μ’s mixture weight.
Moreover, with Solomonoff/Occam weights (w_\nu := 2^{-K(\nu)}), the same paper explicitly notes that this implies:
- (D_n \le K(\mu)\cdot \ln 2),
i.e. in bits, (D_n/\ln 2 \le K(\mu)) up to constants.
This directly contradicts the post’s stated bound that adds (D,H(h)) to a complexity term when the error measure is KL-divergence between the true distribution and the predictor. Intuitively: KL divergence measures mismatch between predicted and true distributions; if the predictor matches the true conditional distribution (even if outcomes are noisy), the KL divergence can be 0—so a necessary additive term proportional to the entropy of the noise and growing with D would not be part of the standard KL-divergence regret bound.
1 source
- Optimality of Universal Bayesian Sequence Prediction for General Loss and Alphabet (Marcus Hutter, JMLR 2003)
Defines cumulative relative entropy D_n := E ln(μ(x1:n)/ξ(x1:n)) and states the bound D_n ≤ ln w_μ^{-1} (Theorem 1). Also notes that with weights w_ν := 2^{-K(ν)}, the relative entropy is bounded by Kolmogorov complexity: D_n ≤ K(μ)·ln 2.