All corrections
X April 6, 2026 at 09:16 PM

x.com/Altimor/status/2040935359542960254

1 correction found

1
Claim
Apparently an LLM only ever activates a small % of its weights at any given time.
Correction

This is not true for LLMs in general. Sparse activation is specific to Mixture-of-Experts models; many mainstream LLMs are dense models that reuse the same parameters for all inputs rather than activating only a small fraction.

Full reasoning

This post overgeneralizes a property of sparse Mixture-of-Experts (MoE) models to LLMs as a whole.

Primary sources distinguish these two cases clearly:

  • The Switch Transformers paper says that deep-learning models "typically reuse the same parameters for all inputs" and that MoE is the exception, because it "selects different parameters for each incoming example," yielding a "sparsely-activated model."
  • The Mixtral of Experts paper gives a concrete sparse example: each token is routed to only 2 experts, and the model "only uses 13B active parameters" out of a larger total during inference.
  • By contrast, major LLMs such as GPT-3 and Llama 3 are explicitly described in primary sources as non-sparse / dense language models. GPT-3 is called a "175 billion parameter" model and "the previous non-sparse language model" benchmark is discussed directly; Meta's Llama 3 paper explicitly says its largest model is a "dense Transformer."

So the accurate statement is: some LLMs (especially MoE models) activate only a subset of parameters per token, but many LLMs are dense and do not work that way. Because the post says this is what "an LLM" does generally, it is factually misleading.

4 sources
Model: OPENAI_GPT_5 Prompt: v1.16.0