LessWrong March 5, 2026 at 12:04 AM

proof-idea-slt-to-ait

1 correction found

Claim

The total error of our prediction in terms of KL-divergence measured in bits, across D data points, should then be bounded below ≤C(f,M 1)+D H(h), where C(f,M 1) is the length of the shortest program that implements f on the UTM M 1 and H(h) is the entropy of our random variable h.

Correction

For Solomonoff/Bayes-mixture prediction, the standard bound on cumulative expected KL divergence is an O(K(μ)) term (Kolmogorov complexity / -log prior weight), not an additional term growing linearly with the number of data points like D·H(h).

Full reasoning

The post claims that cumulative prediction error measured as KL-divergence (in bits) across D datapoints is bounded by a complexity term plus an additional linear-in-D term proportional to the entropy of a hidden randomness variable h.

However, in the standard theory of Bayes-mixtures / Solomonoff-style prediction, the quantity corresponding to cumulative expected KL divergence between the true distribution μ and the mixture predictor ξ is bounded by the negative log prior weight of μ inside the mixture (equivalently, by Kolmogorov complexity when using Solomonoff-Levin weights). This bound does not contain an additive term that scales linearly with sample size D.

Concretely, Hutter (2003) defines the cumulative relative entropy (KL divergence) as:

(D_n := \mathbb{E}{1:n},\ln \frac{\mu(x{1:n})}{\xi(x_{1:n})} = \sum_{t=1}^n \mathbb{E}_{<t} d_t),

and then proves (Theorem 1) the bound:

(D_n \le \ln w_\mu^{-1}),

where (w_\mu) is μ’s mixture weight.

Moreover, with Solomonoff/Occam weights (w_\nu := 2^{-K(\nu)}), the same paper explicitly notes that this implies:

(D_n \le K(\mu)\cdot \ln 2),

i.e. in bits, (D_n/\ln 2 \le K(\mu)) up to constants.

This directly contradicts the post’s stated bound that adds (D,H(h)) to a complexity term when the error measure is KL-divergence between the true distribution and the predictor. Intuitively: KL divergence measures mismatch between predicted and true distributions; if the predictor matches the true conditional distribution (even if outcomes are noisy), the KL divergence can be 0—so a necessary additive term proportional to the entropy of the noise and growing with D would not be part of the standard KL-divergence regret bound.

1 source

Optimality of Universal Bayesian Sequence Prediction for General Loss and Alphabet (Marcus Hutter, JMLR 2003)
Defines cumulative relative entropy D_n := E ln(μ(x1:n)/ξ(x1:n)) and states the bound D_n ≤ ln w_μ^{-1} (Theorem 1). Also notes that with weights w_ν := 2^{-K(ν)}, the relative entropy is bounded by Kolmogorov complexity: D_n ≤ K(μ)·ln 2.

Model: OPENAI_GPT_5 Prompt: v1.6.0