Google DeepMind Releases Quantization-Aware Training Versions of Gemma 4 Models in GGUF Format

TL;DR

Google DeepMind has released quantization-aware training (QAT) optimized versions of its Gemma 4 model family in GGUF Q4_0 format. The QAT versions preserve similar quality to bfloat16 while dramatically reducing memory requirements, with models available across the entire Gemma 4 lineup: E2B, E4B, 12B, 26B A4B, and 31B.

3 min read
0

Google DeepMind Releases Quantization-Aware Training Versions of Gemma 4 Models in GGUF Format

Google DeepMind has released quantization-aware training (QAT) optimized versions of its Gemma 4 model family in GGUF Q4_0 format, designed to preserve similar quality to bfloat16 while dramatically reducing memory requirements.

Four QAT Checkpoint Formats Available

The release includes four distinct QAT checkpoint types:

  • Unquantized QAT checkpoints (Q4_0): Half-precision weights extracted from the QAT pipeline for custom downstream compilation and research. Available for all Gemma 4 models: E2B, E4B, 12B, 26B A4B, 31B, and their drafter models.

  • GGUF (Q4_0): Ready-to-deploy formats for broad ecosystem compatibility, available for the same model lineup.

  • Mobile-optimized (wNa8o8): Custom schema with targeted 2-bit decoding layers, optimized KV caches, and static activations to maximize VRAM savings. Available only for Gemma 4 E2B and E4B.

  • Compressed Tensors (w4a16): QAT checkpoints serialized in compressed-tensors format for native, optimized inference with vLLM. Available for Gemma 4 E2B, E4B, 12B, and 31B.

Gemma 4 Family Specifications

The Gemma 4 family consists of multimodal models handling text and image input, with audio support on E2B, E4B, and 12B variants. Context windows extend to 128K tokens for smaller models (E2B, E4B) and 256K tokens for medium and large models (12B, 26B A4B, 31B).

Dense Models:

  • E2B: 2.3B effective parameters (5.1B with embeddings), 35 layers, 512-token sliding window
  • E4B: 4.5B effective parameters (8B with embeddings), 42 layers, 512-token sliding window
  • 12B Unified: 11.95B parameters, 48 layers, 1024-token sliding window, encoder-free architecture
  • 31B Dense: 30.7B parameters, 60 layers, 1024-token sliding window

Mixture-of-Experts Model:

  • 26B A4B MoE: 25.2B total parameters, 3.8B active parameters, 30 layers, 8 active experts out of 128 total plus 1 shared

All models use a 262K vocabulary size and support multilingual capabilities across over 140 languages.

Benchmark Performance

According to Google DeepMind, the instruction-tuned versions achieve the following scores on key benchmarks:

  • Gemma 4 31B: 85.2% MMLU Pro, 89.2% AIME 2026 (no tools), 80.0% LiveCodeBench v6
  • Gemma 4 26B A4B: 82.6% MMLU Pro, 88.3% AIME 2026, 77.1% LiveCodeBench v6
  • Gemma 4 12B Unified: 77.2% MMLU Pro, 77.5% AIME 2026, 72.0% LiveCodeBench v6
  • Gemma 4 E4B: 69.4% MMLU Pro, 42.5% AIME 2026, 52.0% LiveCodeBench v6
  • Gemma 4 E2B: 60.0% MMLU Pro, 37.5% AIME 2026, 44.0% LiveCodeBench v6

The models also demonstrate vision capabilities with scores ranging from 44.2% (E2B) to 76.9% (31B) on Vision MMMU Pro, and audio processing capabilities on E2B, E4B, and 12B variants.

Technical Architecture

The models employ a hybrid attention mechanism interleaving local sliding window attention with full global attention. Global layers feature unified Keys and Values and apply Proportional RoPE (p-RoPE) to optimize memory for long contexts.

The E2B and E4B models use Per-Layer Embeddings (PLE) for parameter efficiency in on-device deployments. The 12B Unified variant eliminates dedicated encoders, projecting raw image patches and audio waveforms directly into the LLM's embedding space through lightweight linear layers.

What This Means

The QAT-optimized GGUF releases make the Gemma 4 family significantly more accessible for deployment on resource-constrained hardware. By maintaining near-bfloat16 quality at Q4_0 quantization, these models can run on consumer GPUs, laptops, and even mobile devices while preserving competitive benchmark performance. The mobile-optimized wNa8o8 format with 2-bit decoding layers for the smallest models (E2B, E4B) specifically targets edge deployment scenarios where VRAM is severely limited. For developers using vLLM, the compressed-tensors format provides an optimized inference path without requiring custom quantization workflows.

Related Articles

Comments

Loading...