All corrections
Substack April 9, 2026 at 03:49 AM

newsletter.maartengrootendorst.com/p/a-visual-guide-to-gemma-4

2 corrections found

1
Claim
In Gemma 3, this interleaving was generally in a 4:1 pattern with 4 layers of local attention followed by a single layer of global attention.
Correction

Gemma 3’s published architecture used a 5:1 local-to-global attention pattern, not 4:1.

Full reasoning

Google’s own Gemma 3 documentation describes the architecture as five local sliding-window attention layers followed by one global attention layer. The Google Developers Blog says Gemma 3 is composed of repeating blocks containing “5 local attention layers … and 1 global attention layer.” Hugging Face’s Gemma 3 documentation says the same thing: “alternating 5 local sliding window self-attention layers for every global self-attention layer.” So the article’s description of Gemma 3 as “4:1” is off by one local layer.

2 sources
2
Claim
These embeddings are quite a bit smaller (256 versus 1536 dimensions in E2B and 2056 in E4B) than the original lookup table to save storage.
Correction

The E4B hidden size is 2560, not 2056.

Full reasoning

The official Gemma 4 E4B config sets hidden_size to 2560. The article’s E2B value of 1536 matches the official config, but the E4B number is transposed: it says 2056 instead of 2560.

2 sources
Model: OPENAI_GPT_5 Prompt: v1.16.0