All corrections
1
Claim
resulting in at most 16-20 bits of state to be passed between each circuit.
Correction

For a vocabulary of 50k-200k tokens, one token can encode only about 15.6-17.6 bits, not up to 20 bits.

Full reasoning

This is an information-theory/math error. If a state is compressed to one token chosen from a vocabulary of size V, the maximum information that token can carry is log2(V) bits.

Using the vocabulary sizes in the post's own range:

  • 50,257 tokens → log2(50,257) ≈ 15.6 bits
  • 160,000 tokens → log2(160,000) ≈ 17.3 bits
  • 200,019 tokens → log2(200,019) ≈ 17.6 bits
  • even 201,088 tokens → log2(201,088) ≈ 17.6 bits

So a 50k-200k-token vocabulary supports roughly 15.6-17.6 bits, not "16-20 bits." Reaching 20 bits would require about 2^20 = 1,048,576 distinct tokens—far above the 50k-200k range cited here.

So the qualitative point about a token bottleneck may still stand, but the numeric upper bound in this sentence is too high.

2 sources
2
Claim
Compute : matmul of square matrix with size 12288, at logarithmic circuit depth for matmuls we get 13-14 steps
Correction

In GPT-3-style multi-head attention, the attention-score multiplication is not a 12,288×12,288 square matrix multiply.

Full reasoning

This misstates the dimensions used in self-attention.

For GPT-175B / GPT-3-sized models, the hidden size is 12,288 and the number of attention heads is 96. In multi-head attention, the model dimension is split across heads, so each head has dimension 12,288 / 96 = 128, not 12,288.

PyTorch's MultiheadAttention docs state that embed_dim is split across num_heads, and that queries and keys have shapes based on sequence length × embedding dimension. That means the attention-score computation uses query/key matrices shaped like (sequence length, head_dim) and (sequence length, head_dim) per head. For a GPT-3-sized model, the core score computation is therefore based on per-head 128-dimensional vectors (and sequence length), not on a single square 12,288 × 12,288 matrix.

So this line overstates the matrix size used in the self-attention score calculation, which also affects the subsequent depth estimate built on it.

2 sources
Model: OPENAI_GPT_5 Prompt: v1.16.0