LessWrong April 6, 2026 at 03:40 AM

llms-as-giant-lookup-tables-of-shallow...

2 corrections found

Claim

resulting in at most 16-20 bits of state to be passed between each circuit.

Correction

For a vocabulary of 50k-200k tokens, one token can encode only about 15.6-17.6 bits, not up to 20 bits.

Full reasoning

This is an information-theory/math error. If a state is compressed to one token chosen from a vocabulary of size V, the maximum information that token can carry is log2(V) bits.

Using the vocabulary sizes in the post's own range:

50,257 tokens → log2(50,257) ≈ 15.6 bits
160,000 tokens → log2(160,000) ≈ 17.3 bits
200,019 tokens → log2(200,019) ≈ 17.6 bits
even 201,088 tokens → log2(201,088) ≈ 17.6 bits

So a 50k-200k-token vocabulary supports roughly 15.6-17.6 bits, not "16-20 bits." Reaching 20 bits would require about 2^20 = 1,048,576 distinct tokens—far above the 50k-200k range cited here.

So the qualitative point about a token bottleneck may still stand, but the numeric upper bound in this sentence is too high.

2 sources

Kimi K2.5 model card
Model Summary: Vocabulary Size 160K.
OpenAI tiktoken openai_public.py
r50k_base uses "explicit_n_vocab": 50257. The o200k_base tokenizer includes special token ids up to 200018, and o200k_harmony reserves ids through 201087.

Claim

Compute : matmul of square matrix with size 12288, at logarithmic circuit depth for matmuls we get 13-14 steps

Correction

In GPT-3-style multi-head attention, the attention-score multiplication is not a 12,288×12,288 square matrix multiply.

Full reasoning

This misstates the dimensions used in self-attention.

For GPT-175B / GPT-3-sized models, the hidden size is 12,288 and the number of attention heads is 96. In multi-head attention, the model dimension is split across heads, so each head has dimension 12,288 / 96 = 128, not 12,288.

PyTorch's MultiheadAttention docs state that embed_dim is split across num_heads, and that queries and keys have shapes based on sequence length × embedding dimension. That means the attention-score computation uses query/key matrices shaped like (sequence length, head_dim) and (sequence length, head_dim) per head. For a GPT-3-sized model, the core score computation is therefore based on per-head 128-dimensional vectors (and sequence length), not on a single square 12,288 × 12,288 matrix.

So this line overstates the matrix size used in the self-attention score calculation, which also affects the subsequent depth estimate built on it.

2 sources

Introducing PyTorch Fully Sharded Data Parallel (FSDP) API - PyTorch
Benchmark table for GPT 175B: Number of layers 96, Hidden size 12288, Attention heads 96.
MultiheadAttention - PyTorch documentation
"embed_dim will be split across num_heads (i.e. each head will have dimension embed_dim // num_heads)." Query and key embeddings are shaped by sequence length and embedding dimension.

Model: OPENAI_GPT_5 Prompt: v1.16.0