x.com/dearmadisonblue/status/2043073072505159695
1 correction found
LLM performance = log(compute)
This oversimplifies the scaling evidence: canonical LLM scaling-law papers report power-law relationships with compute, not a general rule that performance equals log(compute).
Full reasoning
The best-known empirical scaling-law result for language models does not say performance = log(compute). In the original OpenAI scaling-laws paper, Kaplan et al. write that cross-entropy loss scales as a power law with model size, dataset size, and compute used for training. That is a different functional form from a general logarithmic law.
A later major scaling paper from DeepMind (the Chinchilla paper) likewise frames scaling in terms of compute-optimal tradeoffs between parameters and training tokens, not a universal log(compute) rule. Its core result is that for compute-optimal training, model size and training tokens should be scaled together, and the paper reports downstream performance gains from that regime.
Separately, the post contrasts this with claims that AI is improving exponentially. METR's current public time-horizon page says its frontier-model data are better fit by an exponential trend over time than by linear alternatives, so the broader claim that progress is "really logarithmic" is not an accurate description of the current evidence base.
Because the post states an exact relationship (LLM performance = log(compute)), readers could take it as the established scaling law. The primary literature does not support that formulation.
3 sources
- Scaling Laws for Neural Language Models
Abstract: We study empirical scaling laws for language model performance on the cross-entropy loss. The loss scales as a power-law with model size, dataset size, and the amount of compute used for training...
- Training Compute-Optimal Large Language Models
Abstract: ... for compute-optimal training, the model size and the number of training tokens should be scaled equally: for every doubling of model size the number of training tokens should also be doubled. ... Chinchilla uniformly and significantly outperforms Gopher ... on a large range of downstream evaluation tasks.
- Task-Completion Time Horizons of Frontier AI Models - METR
Why does METR fit an exponential trend ... ? ... an exponential trend is a much better fit to the data than both the linear and hyperbolic trends.