x.com/Altimor/status/2040935359542960254
1 correction found
Apparently an LLM only ever activates a small % of its weights at any given time.
This is not true for LLMs in general. Sparse activation is specific to Mixture-of-Experts models; many mainstream LLMs are dense models that reuse the same parameters for all inputs rather than activating only a small fraction.
Full reasoning
This post overgeneralizes a property of sparse Mixture-of-Experts (MoE) models to LLMs as a whole.
Primary sources distinguish these two cases clearly:
- The Switch Transformers paper says that deep-learning models "typically reuse the same parameters for all inputs" and that MoE is the exception, because it "selects different parameters for each incoming example," yielding a "sparsely-activated model."
- The Mixtral of Experts paper gives a concrete sparse example: each token is routed to only 2 experts, and the model "only uses 13B active parameters" out of a larger total during inference.
- By contrast, major LLMs such as GPT-3 and Llama 3 are explicitly described in primary sources as non-sparse / dense language models. GPT-3 is called a "175 billion parameter" model and "the previous non-sparse language model" benchmark is discussed directly; Meta's Llama 3 paper explicitly says its largest model is a "dense Transformer."
So the accurate statement is: some LLMs (especially MoE models) activate only a subset of parameters per token, but many LLMs are dense and do not work that way. Because the post says this is what "an LLM" does generally, it is factually misleading.
4 sources
- Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity
In deep learning, models typically reuse the same parameters for all inputs. Mixture of Experts (MoE) defies this and instead selects different parameters for each incoming example. The result is a sparsely-activated model...
- Mixtral of Experts
For every token, at each layer, a router network selects two experts ... As a result, each token has access to 47B parameters, but only uses 13B active parameters during inference.
- The Llama 3 Herd of Models
Our largest model is a dense Transformer with 405B parameters and a context window of up to 128K tokens.
- Language Models are Few-Shot Learners
Specifically, we train GPT-3, an autoregressive language model with 175 billion parameters, 10x more than any previous non-sparse language model...