Substack April 9, 2026 at 03:49 AM

a-visual-guide-to-gemma-4

2 corrections found

Claim

In Gemma 3, this interleaving was generally in a 4:1 pattern with 4 layers of local attention followed by a single layer of global attention.

Correction

Gemma 3’s published architecture used a 5:1 local-to-global attention pattern, not 4:1.

Full reasoning

Google’s own Gemma 3 documentation describes the architecture as five local sliding-window attention layers followed by one global attention layer. The Google Developers Blog says Gemma 3 is composed of repeating blocks containing “5 local attention layers … and 1 global attention layer.” Hugging Face’s Gemma 3 documentation says the same thing: “alternating 5 local sliding window self-attention layers for every global self-attention layer.” So the article’s description of Gemma 3 as “4:1” is off by one local layer.

2 sources

Gemma explained: What’s new in Gemma 3 - Google Developers Blog
The updated model architecture is composed of repeating interleaving blocks, each containing 5 local attention layers with a sliding window of 1024 and 1 global attention layer.
Gemma 3 - Hugging Face Transformers documentation
The key differences are alternating 5 local sliding window self-attention layers for every global self-attention layer.

Claim

These embeddings are quite a bit smaller (256 versus 1536 dimensions in E2B and 2056 in E4B) than the original lookup table to save storage.

Correction

The E4B hidden size is 2560, not 2056.

Full reasoning

The official Gemma 4 E4B config sets hidden_size to 2560. The article’s E2B value of 1536 matches the official config, but the E4B number is transposed: it says 2056 instead of 2560.

2 sources

config.json · google/gemma-4-E4B-it
"hidden_size": 2560, "hidden_size_per_layer_input": 256
config.json · google/gemma-4-E2B-it
"hidden_size": 1536, "hidden_size_per_layer_input": 256

Model: OPENAI_GPT_5 Prompt: v1.16.0