Google DeepMind Releases Quantization-Aware Training Versions of Gemma 4 Models in GGUF Format

TL;DR

Google DeepMind has released quantization-aware training (QAT) optimized versions of its Gemma 4 model family in GGUF Q4_0 format. The QAT versions preserve similar quality to bfloat16 while dramatically reducing memory requirements, with models available across the entire Gemma 4 lineup: E2B, E4B, 12B, 26B A4B, and 31B.

June 9, 2026 · 12:36 AM3 min read

Google DeepMind Releases Quantization-Aware Training Versions of Gemma 4 Models in GGUF Format

Google DeepMind has released quantization-aware training (QAT) optimized versions of its Gemma 4 model family in GGUF Q4_0 format, designed to preserve similar quality to bfloat16 while dramatically reducing memory requirements.

Four QAT Checkpoint Formats Available

The release includes four distinct QAT checkpoint types:

Unquantized QAT checkpoints (Q4_0): Half-precision weights extracted from the QAT pipeline for custom downstream compilation and research. Available for all Gemma 4 models: E2B, E4B, 12B, 26B A4B, 31B, and their drafter models.
GGUF (Q4_0): Ready-to-deploy formats for broad ecosystem compatibility, available for the same model lineup.
Mobile-optimized (wNa8o8): Custom schema with targeted 2-bit decoding layers, optimized KV caches, and static activations to maximize VRAM savings. Available only for Gemma 4 E2B and E4B.
Compressed Tensors (w4a16): QAT checkpoints serialized in compressed-tensors format for native, optimized inference with vLLM. Available for Gemma 4 E2B, E4B, 12B, and 31B.

Gemma 4 Family Specifications

The Gemma 4 family consists of multimodal models handling text and image input, with audio support on E2B, E4B, and 12B variants. Context windows extend to 128K tokens for smaller models (E2B, E4B) and 256K tokens for medium and large models (12B, 26B A4B, 31B).

Dense Models:

E2B: 2.3B effective parameters (5.1B with embeddings), 35 layers, 512-token sliding window
E4B: 4.5B effective parameters (8B with embeddings), 42 layers, 512-token sliding window
12B Unified: 11.95B parameters, 48 layers, 1024-token sliding window, encoder-free architecture
31B Dense: 30.7B parameters, 60 layers, 1024-token sliding window

Mixture-of-Experts Model:

26B A4B MoE: 25.2B total parameters, 3.8B active parameters, 30 layers, 8 active experts out of 128 total plus 1 shared

All models use a 262K vocabulary size and support multilingual capabilities across over 140 languages.

Benchmark Performance

According to Google DeepMind, the instruction-tuned versions achieve the following scores on key benchmarks:

Gemma 4 31B: 85.2% MMLU Pro, 89.2% AIME 2026 (no tools), 80.0% LiveCodeBench v6
Gemma 4 26B A4B: 82.6% MMLU Pro, 88.3% AIME 2026, 77.1% LiveCodeBench v6
Gemma 4 12B Unified: 77.2% MMLU Pro, 77.5% AIME 2026, 72.0% LiveCodeBench v6
Gemma 4 E4B: 69.4% MMLU Pro, 42.5% AIME 2026, 52.0% LiveCodeBench v6
Gemma 4 E2B: 60.0% MMLU Pro, 37.5% AIME 2026, 44.0% LiveCodeBench v6

The models also demonstrate vision capabilities with scores ranging from 44.2% (E2B) to 76.9% (31B) on Vision MMMU Pro, and audio processing capabilities on E2B, E4B, and 12B variants.

Technical Architecture

The models employ a hybrid attention mechanism interleaving local sliding window attention with full global attention. Global layers feature unified Keys and Values and apply Proportional RoPE (p-RoPE) to optimize memory for long contexts.

The E2B and E4B models use Per-Layer Embeddings (PLE) for parameter efficiency in on-device deployments. The 12B Unified variant eliminates dedicated encoders, projecting raw image patches and audio waveforms directly into the LLM's embedding space through lightweight linear layers.

What This Means

The QAT-optimized GGUF releases make the Gemma 4 family significantly more accessible for deployment on resource-constrained hardware. By maintaining near-bfloat16 quality at Q4_0 quantization, these models can run on consumer GPUs, laptops, and even mobile devices while preserving competitive benchmark performance. The mobile-optimized wNa8o8 format with 2-bit decoding layers for the smallest models (E2B, E4B) specifically targets edge deployment scenarios where VRAM is severely limited. For developers using vLLM, the compressed-tensors format provides an optimized inference path without requiring custom quantization workflows.

Source: huggingface.co ↗

Google DeepMind Gemma 4 quantization GGUF model optimization QAT multimodal on-device AI

model releaseJune 17, 2026

NVIDIA Releases Quantized DiffusionGemma 26B: 1,100+ Tokens/Second with 256K Context Window

NVIDIA released a quantized version of Google DeepMind's DiffusionGemma 26B A4B IT, a multimodal model with 25.2B total parameters (3.8B active) that processes text, image, and video inputs. The NVFP4-quantized model achieves generation speeds exceeding 1,100 tokens per second on NVIDIA H100 GPUs while supporting a 256K token context window.

model releaseJune 15, 2026

Amazon Bedrock adds Gemma 4 models with 256K context and built-in reasoning mode

Amazon Web Services today announced availability of Google DeepMind's Gemma 4 family on Amazon Bedrock. The open-weight models include three instruction-tuned variants spanning 2.3B to 30.7B parameters, with 256K context windows, multimodal input support, and built-in reasoning mode.

researchJuly 20, 2026

Google DeepMind's GenCeption uses video generator for computer vision with 500x less training data

Google DeepMind researchers developed GenCeption, which repurposes Alibaba's Wan2.1 video generator for computer vision tasks including depth estimation, segmentation, and 3D pose estimation. The model matches state-of-the-art specialized systems while training on only 7,500 synthetic videos—between 7 and 500 times less data than competing approaches.

changelogJuly 18, 2026

Anthropic reverses course, makes Claude Fable 5 permanent on subscription plans

Anthropic announced July 18 that Claude Fable 5 will remain available on subscription plans, reversing its previous decision to make the model API-only. Max and Team Premium subscribers will receive access at 50% of standard limits starting July 20, while Pro and Team Standard users get a one-time $100 credit.

Google DeepMind Releases Quantization-Aware Training Versions of Gemma 4 Models in GGUF Format

Google DeepMind Releases Quantization-Aware Training Versions of Gemma 4 Models in GGUF Format

Four QAT Checkpoint Formats Available

Gemma 4 Family Specifications

Benchmark Performance

Technical Architecture

What This Means

Related Articles

NVIDIA Releases Quantized DiffusionGemma 26B: 1,100+ Tokens/Second with 256K Context Window

Amazon Bedrock adds Gemma 4 models with 256K context and built-in reasoning mode

Google DeepMind's GenCeption uses video generator for computer vision with 500x less training data

Anthropic reverses course, makes Claude Fable 5 permanent on subscription plans

Comments