www.lesswrong.com/posts/a9KqqgjN8gc3Mzzkh/llms-as-giant-lookup-tables-of-shallow...
2 corrections found
resulting in at most 16-20 bits of state to be passed between each circuit.
For a vocabulary of 50k-200k tokens, one token can encode only about 15.6-17.6 bits, not up to 20 bits.
Full reasoning
This is an information-theory/math error. If a state is compressed to one token chosen from a vocabulary of size V, the maximum information that token can carry is log2(V) bits.
Using the vocabulary sizes in the post's own range:
- 50,257 tokens → log2(50,257) ≈ 15.6 bits
- 160,000 tokens → log2(160,000) ≈ 17.3 bits
- 200,019 tokens → log2(200,019) ≈ 17.6 bits
- even 201,088 tokens → log2(201,088) ≈ 17.6 bits
So a 50k-200k-token vocabulary supports roughly 15.6-17.6 bits, not "16-20 bits." Reaching 20 bits would require about 2^20 = 1,048,576 distinct tokens—far above the 50k-200k range cited here.
So the qualitative point about a token bottleneck may still stand, but the numeric upper bound in this sentence is too high.
2 sources
- Kimi K2.5 model card
Model Summary: Vocabulary Size 160K.
- OpenAI tiktoken openai_public.py
r50k_base uses "explicit_n_vocab": 50257. The o200k_base tokenizer includes special token ids up to 200018, and o200k_harmony reserves ids through 201087.
Compute : matmul of square matrix with size 12288, at logarithmic circuit depth for matmuls we get 13-14 steps
In GPT-3-style multi-head attention, the attention-score multiplication is not a 12,288×12,288 square matrix multiply.
Full reasoning
This misstates the dimensions used in self-attention.
For GPT-175B / GPT-3-sized models, the hidden size is 12,288 and the number of attention heads is 96. In multi-head attention, the model dimension is split across heads, so each head has dimension 12,288 / 96 = 128, not 12,288.
PyTorch's MultiheadAttention docs state that embed_dim is split across num_heads, and that queries and keys have shapes based on sequence length × embedding dimension. That means the attention-score computation uses query/key matrices shaped like (sequence length, head_dim) and (sequence length, head_dim) per head. For a GPT-3-sized model, the core score computation is therefore based on per-head 128-dimensional vectors (and sequence length), not on a single square 12,288 × 12,288 matrix.
So this line overstates the matrix size used in the self-attention score calculation, which also affects the subsequent depth estimate built on it.
2 sources
- Introducing PyTorch Fully Sharded Data Parallel (FSDP) API - PyTorch
Benchmark table for GPT 175B: Number of layers 96, Hidden size 12288, Attention heads 96.
- MultiheadAttention - PyTorch documentation
"embed_dim will be split across num_heads (i.e. each head will have dimension embed_dim // num_heads)." Query and key embeddings are shaped by sequence length and embedding dimension.