model releaseGoogle DeepMind

Google DeepMind releases Gemma 4 with 31B dense model, 256K context window, and speculative decoding drafters

TL;DR

Google DeepMind has released Gemma 4, a family of open-weight multimodal models including a 31B dense model with 256K context window and four size variants ranging from 2.3B to 30.7B effective parameters. The release includes Multi-Token Prediction (MTP) draft models that achieve up to 2x decoding speedup through speculative decoding while maintaining identical output quality.

3 min read
0

Google DeepMind releases Gemma 4 with 31B dense model, 256K context window, and speculative decoding drafters

Google DeepMind has released Gemma 4, a family of open-weight multimodal models featuring a 31B dense model with 256K context window and Multi-Token Prediction (MTP) draft models that deliver up to 2x inference speedup through speculative decoding.

Model lineup and specifications

Gemma 4 includes four model sizes across dense and Mixture-of-Experts (MoE) architectures:

Dense models:

  • E2B: 2.3B effective parameters (5.1B with embeddings), 128K context, 35 layers
  • E4B: 4.5B effective parameters (8B with embeddings), 128K context, 42 layers
  • 31B: 30.7B parameters, 256K context, 60 layers

MoE model:

  • 26B A4B: 25.2B total parameters, 3.8B active parameters, 256K context, 30 layers with 8 active experts out of 128 total plus 1 shared expert

All models use 262K vocabulary size and support multilingual text across 140+ languages. The E2B and E4B models include native audio processing capabilities with approximately 300M audio encoder parameters.

Multi-Token Prediction drafters

The key innovation in this release is the MTP assistant models. According to Google DeepMind, these smaller draft models predict multiple tokens ahead, which the target model verifies in parallel during speculative decoding. This approach delivers up to 2x speedup while guaranteeing identical output quality to standard generation.

Benchmark performance

Google DeepMind reports the following scores for instruction-tuned models:

Gemma 4 31B:

  • MMLU Pro: 85.2%
  • AIME 2026 (no tools): 89.2%
  • LiveCodeBench v6: 80.0%
  • Codeforces ELO: 2150
  • GPQA Diamond: 84.3%
  • Vision MMMU Pro: 76.9%
  • MATH-Vision: 85.6%

Gemma 4 26B A4B (MoE):

  • MMLU Pro: 82.6%
  • AIME 2026: 88.3%
  • LiveCodeBench v6: 77.1%
  • Codeforces ELO: 1718

For comparison, the previous Gemma 3 27B (without thinking mode) scored 67.6% on MMLU Pro and 20.8% on AIME 2026.

Architecture details

The models employ a hybrid attention mechanism that interleaves local sliding window attention (512 tokens for E2B/E4B, 1024 tokens for larger models) with full global attention. The final layer always uses global attention. Global layers feature unified Keys and Values with Proportional RoPE (p-RoPE) to optimize memory for long contexts.

The E2B and E4B models use Per-Layer Embeddings (PLE), giving each decoder layer its own small embedding table for every token. This design maximizes parameter efficiency for on-device deployment.

Multimodal capabilities

All Gemma 4 models handle text and image input with variable aspect ratios and resolutions. Vision encoders range from approximately 150M parameters (E2B/E4B) to 550M parameters (26B A4B/31B). The E2B and E4B models additionally process video frame sequences and native audio input.

According to Google DeepMind, capabilities include object detection, document/PDF parsing, OCR across multiple languages, handwriting recognition, chart comprehension, automatic speech recognition, and speech-to-translated-text translation.

Availability

The models are released under Apache 2.0 license and available now on Hugging Face. Integration requires transformers, torch, and accelerate libraries. Google DeepMind designed the smaller models specifically for local execution on laptops and mobile devices, while the larger models target consumer GPUs and workstations.

What this means

Gemma 4's combination of speculative decoding drafters and diverse model sizes directly addresses the inference speed and deployment flexibility gaps in open-weight models. The 2x speedup claim—if validated in practice—makes these models competitive with proprietary offerings for latency-sensitive applications. The MoE architecture in the 26B A4B model is particularly notable: by activating only 3.8B parameters during inference while accessing 25.2B total parameters, it potentially delivers near-31B performance at near-4B speed, a meaningful advance for resource-constrained deployments.

Related Articles

model release

Mistral Releases Mistral 3 Family: 675B-Parameter Large 3 MoE and Three Edge Models Under Apache 2.0

Mistral has released Mistral 3, including Mistral Large 3—a sparse mixture-of-experts model with 41B active and 675B total parameters—and three Ministral 3 edge models (3B, 8B, 14B). All models are released under Apache 2.0 license with multimodal capabilities and are available today on multiple platforms.

model release

Amazon Bedrock adds Gemma 4 models with 256K context and built-in reasoning mode

Amazon Web Services today announced availability of Google DeepMind's Gemma 4 family on Amazon Bedrock. The open-weight models include three instruction-tuned variants spanning 2.3B to 30.7B parameters, with 256K context windows, multimodal input support, and built-in reasoning mode.

model release

MiniMax Releases M3: 428B-Parameter Multimodal Model with 1M Context Window and 15× Decode Speedup

MiniMax has released M3, a multimodal model with approximately 428 billion parameters and 23 billion activated parameters. The model supports a 1 million token context window and uses MiniMax Sparse Attention to achieve 9× prefill and 15× decode speedups compared to its predecessor M2.

model release

Moonshot AI releases Kimi K2.7 Code with 1T parameters, 256K context window, 30% lower thinking token usage

Moonshot AI has released Kimi K2.7 Code, a 1 trillion parameter Mixture-of-Experts model designed for long-horizon coding tasks. The model features a 256K context window and reduces thinking token usage by approximately 30% compared to its predecessor K2.6.

Comments

Loading...