Google DeepMind Releases Quantization-Aware Training Versions of Gemma 4 Models in GGUF Format
Google DeepMind has released quantization-aware training (QAT) optimized versions of its Gemma 4 model family in GGUF Q4_0 format. The QAT versions preserve similar quality to bfloat16 while dramatically reducing memory requirements, with models available across the entire Gemma 4 lineup: E2B, E4B, 12B, 26B A4B, and 31B.
Google DeepMind Releases Quantization-Aware Training Versions of Gemma 4 Models in GGUF Format
Google DeepMind has released quantization-aware training (QAT) optimized versions of its Gemma 4 model family in GGUF Q4_0 format, designed to preserve similar quality to bfloat16 while dramatically reducing memory requirements.
Four QAT Checkpoint Formats Available
The release includes four distinct QAT checkpoint types:
-
Unquantized QAT checkpoints (Q4_0): Half-precision weights extracted from the QAT pipeline for custom downstream compilation and research. Available for all Gemma 4 models: E2B, E4B, 12B, 26B A4B, 31B, and their drafter models.
-
GGUF (Q4_0): Ready-to-deploy formats for broad ecosystem compatibility, available for the same model lineup.
-
Mobile-optimized (wNa8o8): Custom schema with targeted 2-bit decoding layers, optimized KV caches, and static activations to maximize VRAM savings. Available only for Gemma 4 E2B and E4B.
-
Compressed Tensors (w4a16): QAT checkpoints serialized in compressed-tensors format for native, optimized inference with vLLM. Available for Gemma 4 E2B, E4B, 12B, and 31B.
Gemma 4 Family Specifications
The Gemma 4 family consists of multimodal models handling text and image input, with audio support on E2B, E4B, and 12B variants. Context windows extend to 128K tokens for smaller models (E2B, E4B) and 256K tokens for medium and large models (12B, 26B A4B, 31B).
Dense Models:
- E2B: 2.3B effective parameters (5.1B with embeddings), 35 layers, 512-token sliding window
- E4B: 4.5B effective parameters (8B with embeddings), 42 layers, 512-token sliding window
- 12B Unified: 11.95B parameters, 48 layers, 1024-token sliding window, encoder-free architecture
- 31B Dense: 30.7B parameters, 60 layers, 1024-token sliding window
Mixture-of-Experts Model:
- 26B A4B MoE: 25.2B total parameters, 3.8B active parameters, 30 layers, 8 active experts out of 128 total plus 1 shared
All models use a 262K vocabulary size and support multilingual capabilities across over 140 languages.
Benchmark Performance
According to Google DeepMind, the instruction-tuned versions achieve the following scores on key benchmarks:
- Gemma 4 31B: 85.2% MMLU Pro, 89.2% AIME 2026 (no tools), 80.0% LiveCodeBench v6
- Gemma 4 26B A4B: 82.6% MMLU Pro, 88.3% AIME 2026, 77.1% LiveCodeBench v6
- Gemma 4 12B Unified: 77.2% MMLU Pro, 77.5% AIME 2026, 72.0% LiveCodeBench v6
- Gemma 4 E4B: 69.4% MMLU Pro, 42.5% AIME 2026, 52.0% LiveCodeBench v6
- Gemma 4 E2B: 60.0% MMLU Pro, 37.5% AIME 2026, 44.0% LiveCodeBench v6
The models also demonstrate vision capabilities with scores ranging from 44.2% (E2B) to 76.9% (31B) on Vision MMMU Pro, and audio processing capabilities on E2B, E4B, and 12B variants.
Technical Architecture
The models employ a hybrid attention mechanism interleaving local sliding window attention with full global attention. Global layers feature unified Keys and Values and apply Proportional RoPE (p-RoPE) to optimize memory for long contexts.
The E2B and E4B models use Per-Layer Embeddings (PLE) for parameter efficiency in on-device deployments. The 12B Unified variant eliminates dedicated encoders, projecting raw image patches and audio waveforms directly into the LLM's embedding space through lightweight linear layers.
What This Means
The QAT-optimized GGUF releases make the Gemma 4 family significantly more accessible for deployment on resource-constrained hardware. By maintaining near-bfloat16 quality at Q4_0 quantization, these models can run on consumer GPUs, laptops, and even mobile devices while preserving competitive benchmark performance. The mobile-optimized wNa8o8 format with 2-bit decoding layers for the smallest models (E2B, E4B) specifically targets edge deployment scenarios where VRAM is severely limited. For developers using vLLM, the compressed-tensors format provides an optimized inference path without requiring custom quantization workflows.
Related Articles
Google DeepMind Releases Gemma 4: Encoder-Free Multimodal Models from 2.3B to 30.7B Parameters
Google DeepMind released Gemma 4, a family of open-weight multimodal models ranging from 2.3B to 30.7B parameters. The flagship 12B Unified model eliminates separate encoders, processing text, images, audio, and video directly through a single decoder-only transformer with up to 256K token context window.
Google DeepMind releases Gemma 4 12B Unified: encoder-free multimodal model with 256K context window
Google DeepMind has released Gemma 4 12B Unified, an encoder-free multimodal model that processes text, images, and audio through a single decoder-only transformer. The model features 11.95 billion parameters, a 256K token context window, and achieves 77.2% on MMLU Pro and 72.0% on LiveCodeBench v6.
Google AI Plus drops to $4.99/month with 400GB storage, down from $7.99
Google reduced its AI Plus subscription from $7.99 to $4.99 per month and doubled storage from 200GB to 400GB. The plan includes 2x higher Gemini usage limits with a 128,000 token context window, along with features like daily briefs and video generation.
Anthropic Python SDK v0.106.0 marks Claude Opus 4.1 as deprecated
Anthropic released version 0.106.0 of its Python SDK on June 5, 2026, marking Claude Opus 4.1 as deprecated. The update also includes bug fixes for Foundry client methods and schema transformation handling.
Comments
Loading...