newsletter.maartengrootendorst.com/p/a-visual-guide-to-gemma-4
2 corrections found
In Gemma 3, this interleaving was generally in a 4:1 pattern with 4 layers of local attention followed by a single layer of global attention.
Gemma 3’s published architecture used a 5:1 local-to-global attention pattern, not 4:1.
Full reasoning
Google’s own Gemma 3 documentation describes the architecture as five local sliding-window attention layers followed by one global attention layer. The Google Developers Blog says Gemma 3 is composed of repeating blocks containing “5 local attention layers … and 1 global attention layer.” Hugging Face’s Gemma 3 documentation says the same thing: “alternating 5 local sliding window self-attention layers for every global self-attention layer.” So the article’s description of Gemma 3 as “4:1” is off by one local layer.
2 sources
- Gemma explained: What’s new in Gemma 3 - Google Developers Blog
The updated model architecture is composed of repeating interleaving blocks, each containing 5 local attention layers with a sliding window of 1024 and 1 global attention layer.
- Gemma 3 - Hugging Face Transformers documentation
The key differences are alternating 5 local sliding window self-attention layers for every global self-attention layer.
These embeddings are quite a bit smaller (256 versus 1536 dimensions in E2B and 2056 in E4B) than the original lookup table to save storage.
The E4B hidden size is 2560, not 2056.
Full reasoning
The official Gemma 4 E4B config sets hidden_size to 2560. The article’s E2B value of 1536 matches the official config, but the E4B number is transposed: it says 2056 instead of 2560.
2 sources
- config.json · google/gemma-4-E4B-it
"hidden_size": 2560, "hidden_size_per_layer_input": 256
- config.json · google/gemma-4-E2B-it
"hidden_size": 1536, "hidden_size_per_layer_input": 256