Google DeepMind releases Gemma 4 with 31B dense model, 256K context window, and speculative decoding drafters

TL;DR

Google DeepMind has released Gemma 4, a family of open-weight multimodal models including a 31B dense model with 256K context window and four size variants ranging from 2.3B to 30.7B effective parameters. The release includes Multi-Token Prediction (MTP) draft models that achieve up to 2x decoding speedup through speculative decoding while maintaining identical output quality.

May 6, 2026 · 1:51 AM3 min read

Gemma 4 31B IT Assistant (MTP Drafter) — Quick Specs

Context window256K tokens

Compare Gemma 4 31B IT Assistant (MTP Drafter) with other models →

Google DeepMind releases Gemma 4 with 31B dense model, 256K context window, and speculative decoding drafters

Google DeepMind has released Gemma 4, a family of open-weight multimodal models featuring a 31B dense model with 256K context window and Multi-Token Prediction (MTP) draft models that deliver up to 2x inference speedup through speculative decoding.

Model lineup and specifications

Gemma 4 includes four model sizes across dense and Mixture-of-Experts (MoE) architectures:

Dense models:

E2B: 2.3B effective parameters (5.1B with embeddings), 128K context, 35 layers
E4B: 4.5B effective parameters (8B with embeddings), 128K context, 42 layers
31B: 30.7B parameters, 256K context, 60 layers

MoE model:

26B A4B: 25.2B total parameters, 3.8B active parameters, 256K context, 30 layers with 8 active experts out of 128 total plus 1 shared expert

All models use 262K vocabulary size and support multilingual text across 140+ languages. The E2B and E4B models include native audio processing capabilities with approximately 300M audio encoder parameters.

Multi-Token Prediction drafters

The key innovation in this release is the MTP assistant models. According to Google DeepMind, these smaller draft models predict multiple tokens ahead, which the target model verifies in parallel during speculative decoding. This approach delivers up to 2x speedup while guaranteeing identical output quality to standard generation.

Benchmark performance

Google DeepMind reports the following scores for instruction-tuned models:

Gemma 4 31B:

MMLU Pro: 85.2%
AIME 2026 (no tools): 89.2%
LiveCodeBench v6: 80.0%
Codeforces ELO: 2150
GPQA Diamond: 84.3%
Vision MMMU Pro: 76.9%
MATH-Vision: 85.6%

Gemma 4 26B A4B (MoE):

MMLU Pro: 82.6%
AIME 2026: 88.3%
LiveCodeBench v6: 77.1%
Codeforces ELO: 1718

For comparison, the previous Gemma 3 27B (without thinking mode) scored 67.6% on MMLU Pro and 20.8% on AIME 2026.

Architecture details

The models employ a hybrid attention mechanism that interleaves local sliding window attention (512 tokens for E2B/E4B, 1024 tokens for larger models) with full global attention. The final layer always uses global attention. Global layers feature unified Keys and Values with Proportional RoPE (p-RoPE) to optimize memory for long contexts.

The E2B and E4B models use Per-Layer Embeddings (PLE), giving each decoder layer its own small embedding table for every token. This design maximizes parameter efficiency for on-device deployment.

Multimodal capabilities

All Gemma 4 models handle text and image input with variable aspect ratios and resolutions. Vision encoders range from approximately 150M parameters (E2B/E4B) to 550M parameters (26B A4B/31B). The E2B and E4B models additionally process video frame sequences and native audio input.

According to Google DeepMind, capabilities include object detection, document/PDF parsing, OCR across multiple languages, handwriting recognition, chart comprehension, automatic speech recognition, and speech-to-translated-text translation.

Availability

The models are released under Apache 2.0 license and available now on Hugging Face. Integration requires transformers, torch, and accelerate libraries. Google DeepMind designed the smaller models specifically for local execution on laptops and mobile devices, while the larger models target consumer GPUs and workstations.

What this means

Gemma 4's combination of speculative decoding drafters and diverse model sizes directly addresses the inference speed and deployment flexibility gaps in open-weight models. The 2x speedup claim—if validated in practice—makes these models competitive with proprietary offerings for latency-sensitive applications. The MoE architecture in the 26B A4B model is particularly notable: by activating only 3.8B parameters during inference while accessing 25.2B total parameters, it potentially delivers near-31B performance at near-4B speed, a meaningful advance for resource-constrained deployments.

Source: huggingface.co ↗

google-deepmind gemma-4 multimodal open-weights speculative-decoding mixture-of-experts long-context on-device

model releaseApril 29, 2026

Mistral Releases Medium 3.5: 128B Dense Model With 256k Context and Configurable Reasoning

Mistral AI released Mistral Medium 3.5, a 128B parameter dense model with a 256k context window that unifies instruction-following, reasoning, and coding capabilities. The model features configurable reasoning effort per request and a vision encoder trained from scratch for variable image sizes.

model releaseMay 2, 2026

NVIDIA releases Nemotron-3-Nano-Omni-30B, a 31B-parameter multimodal model with 256K context and reasoning mode

NVIDIA released Nemotron-3-Nano-Omni-30B-A3B, a multimodal large language model with 31 billion parameters that processes video, audio, images, and text with up to 256K token context. The model uses a Mamba2-Transformer hybrid Mixture of Experts architecture and supports chain-of-thought reasoning mode.

model releaseMay 1, 2026

IBM Releases Granite 4.1 30B With 131K Context Window and Enhanced Tool-Calling

IBM released Granite 4.1 30B, a 30-billion parameter instruction-following model with a 131,072 token context window. The model scores 80.16 on MMLU 5-shot and 88.41 on HumanEval pass@1, with enhanced tool-calling capabilities following OpenAI's function definition schema.

model releaseApril 30, 2026

xAI releases Grok 4.3 reasoning model with 1M token context at $1.25/M input tokens

xAI has released Grok 4.3, a reasoning model with a 1 million token context window and no output token limit. The model accepts text and image inputs, has always-on reasoning that cannot be disabled, and uses tiered pricing starting at $1.25 per million input tokens and $2.50 per million output tokens.

Google DeepMind releases Gemma 4 with 31B dense model, 256K context window, and speculative decoding drafters

Gemma 4 31B IT Assistant (MTP Drafter) — Quick Specs

Google DeepMind releases Gemma 4 with 31B dense model, 256K context window, and speculative decoding drafters

Model lineup and specifications

Multi-Token Prediction drafters

Benchmark performance

Architecture details

Multimodal capabilities

Availability

What this means

Related Articles

Mistral Releases Medium 3.5: 128B Dense Model With 256k Context and Configurable Reasoning

NVIDIA releases Nemotron-3-Nano-Omni-30B, a 31B-parameter multimodal model with 256K context and reasoning mode

IBM Releases Granite 4.1 30B With 131K Context Window and Enhanced Tool-Calling

xAI releases Grok 4.3 reasoning model with 1M token context at $1.25/M input tokens

Comments