Google DeepMind releases Gemma 4 with 31B dense model, 256K context window, and speculative decoding drafters
Google DeepMind has released Gemma 4, a family of open-weight multimodal models including a 31B dense model with 256K context window and four size variants ranging from 2.3B to 30.7B effective parameters. The release includes Multi-Token Prediction (MTP) draft models that achieve up to 2x decoding speedup through speculative decoding while maintaining identical output quality.
Gemma 4 31B IT Assistant (MTP Drafter) — Quick Specs
Google DeepMind releases Gemma 4 with 31B dense model, 256K context window, and speculative decoding drafters
Google DeepMind has released Gemma 4, a family of open-weight multimodal models featuring a 31B dense model with 256K context window and Multi-Token Prediction (MTP) draft models that deliver up to 2x inference speedup through speculative decoding.
Model lineup and specifications
Gemma 4 includes four model sizes across dense and Mixture-of-Experts (MoE) architectures:
Dense models:
- E2B: 2.3B effective parameters (5.1B with embeddings), 128K context, 35 layers
- E4B: 4.5B effective parameters (8B with embeddings), 128K context, 42 layers
- 31B: 30.7B parameters, 256K context, 60 layers
MoE model:
- 26B A4B: 25.2B total parameters, 3.8B active parameters, 256K context, 30 layers with 8 active experts out of 128 total plus 1 shared expert
All models use 262K vocabulary size and support multilingual text across 140+ languages. The E2B and E4B models include native audio processing capabilities with approximately 300M audio encoder parameters.
Multi-Token Prediction drafters
The key innovation in this release is the MTP assistant models. According to Google DeepMind, these smaller draft models predict multiple tokens ahead, which the target model verifies in parallel during speculative decoding. This approach delivers up to 2x speedup while guaranteeing identical output quality to standard generation.
Benchmark performance
Google DeepMind reports the following scores for instruction-tuned models:
Gemma 4 31B:
- MMLU Pro: 85.2%
- AIME 2026 (no tools): 89.2%
- LiveCodeBench v6: 80.0%
- Codeforces ELO: 2150
- GPQA Diamond: 84.3%
- Vision MMMU Pro: 76.9%
- MATH-Vision: 85.6%
Gemma 4 26B A4B (MoE):
- MMLU Pro: 82.6%
- AIME 2026: 88.3%
- LiveCodeBench v6: 77.1%
- Codeforces ELO: 1718
For comparison, the previous Gemma 3 27B (without thinking mode) scored 67.6% on MMLU Pro and 20.8% on AIME 2026.
Architecture details
The models employ a hybrid attention mechanism that interleaves local sliding window attention (512 tokens for E2B/E4B, 1024 tokens for larger models) with full global attention. The final layer always uses global attention. Global layers feature unified Keys and Values with Proportional RoPE (p-RoPE) to optimize memory for long contexts.
The E2B and E4B models use Per-Layer Embeddings (PLE), giving each decoder layer its own small embedding table for every token. This design maximizes parameter efficiency for on-device deployment.
Multimodal capabilities
All Gemma 4 models handle text and image input with variable aspect ratios and resolutions. Vision encoders range from approximately 150M parameters (E2B/E4B) to 550M parameters (26B A4B/31B). The E2B and E4B models additionally process video frame sequences and native audio input.
According to Google DeepMind, capabilities include object detection, document/PDF parsing, OCR across multiple languages, handwriting recognition, chart comprehension, automatic speech recognition, and speech-to-translated-text translation.
Availability
The models are released under Apache 2.0 license and available now on Hugging Face. Integration requires transformers, torch, and accelerate libraries. Google DeepMind designed the smaller models specifically for local execution on laptops and mobile devices, while the larger models target consumer GPUs and workstations.
What this means
Gemma 4's combination of speculative decoding drafters and diverse model sizes directly addresses the inference speed and deployment flexibility gaps in open-weight models. The 2x speedup claim—if validated in practice—makes these models competitive with proprietary offerings for latency-sensitive applications. The MoE architecture in the 26B A4B model is particularly notable: by activating only 3.8B parameters during inference while accessing 25.2B total parameters, it potentially delivers near-31B performance at near-4B speed, a meaningful advance for resource-constrained deployments.
Related Articles
Mistral Releases Medium 3.5: 128B Dense Model With 256k Context and Configurable Reasoning
Mistral AI released Mistral Medium 3.5, a 128B parameter dense model with a 256k context window that unifies instruction-following, reasoning, and coding capabilities. The model features configurable reasoning effort per request and a vision encoder trained from scratch for variable image sizes.
NVIDIA releases Nemotron-3-Nano-Omni-30B, a 31B-parameter multimodal model with 256K context and reasoning mode
NVIDIA released Nemotron-3-Nano-Omni-30B-A3B, a multimodal large language model with 31 billion parameters that processes video, audio, images, and text with up to 256K token context. The model uses a Mamba2-Transformer hybrid Mixture of Experts architecture and supports chain-of-thought reasoning mode.
IBM Releases Granite 4.1 30B With 131K Context Window and Enhanced Tool-Calling
IBM released Granite 4.1 30B, a 30-billion parameter instruction-following model with a 131,072 token context window. The model scores 80.16 on MMLU 5-shot and 88.41 on HumanEval pass@1, with enhanced tool-calling capabilities following OpenAI's function definition schema.
xAI releases Grok 4.3 reasoning model with 1M token context at $1.25/M input tokens
xAI has released Grok 4.3, a reasoning model with a 1 million token context window and no output token limit. The model accepts text and image inputs, has always-on reasoning that cannot be disabled, and uses tiered pricing starting at $1.25 per million input tokens and $2.50 per million output tokens.
Comments
Loading...