Google DeepMind Releases Gemma 4 26B A4B Assistant Model for 2x Faster Inference via Multi-Token Prediction

TL;DR

Google DeepMind has released a Multi-Token Prediction assistant model for Gemma 4 26B A4B that achieves up to 2x decoding speedup through speculative decoding. The model uses 3.8B active parameters from a 25.2B total parameter MoE architecture with 128 experts and a 256K token context window.

May 6, 2026 · 3:06 PM2 min read

Gemma 4 26B A4B Assistant — Quick Specs

Context window256K tokens

Compare Gemma 4 26B A4B Assistant with other models →

Google DeepMind Releases Gemma 4 26B A4B Assistant Model for 2x Faster Inference via Multi-Token Prediction

Google DeepMind has released a Multi-Token Prediction (MTP) drafter model for Gemma 4 26B A4B, designed to accelerate inference through speculative decoding. According to Google, the assistant model achieves up to 2x speedup while maintaining identical output quality to standard generation.

Technical Architecture

The Gemma 4 26B A4B base model uses a Mixture-of-Experts architecture with 25.2B total parameters but only 3.8B active parameters during inference. The model features:

30 layers with 1024-token sliding window attention
8 active experts selected from 128 total experts plus 1 shared expert
256K token context window
262K vocabulary size
~550M parameter vision encoder for multimodal capabilities

The MTP assistant model extends this base by adding a smaller, faster draft model that predicts several tokens ahead. The target model then verifies these predictions in parallel, enabling the speedup without sacrificing quality.

Benchmark Performance

Google reports the following scores for the instruction-tuned 26B A4B model:

MMLU Pro: 82.6%
AIME 2026 (no tools): 88.3%
LiveCodeBench v6: 77.1%
Codeforces ELO: 1718
GPQA Diamond: 82.3%
Vision MMMU Pro: 73.8%
MATH-Vision: 82.4%

Model Capabilities

The model supports text and image input with variable aspect ratios and resolutions. Key capabilities include:

Native function calling for agentic workflows
Configurable reasoning modes with step-by-step thinking
Document parsing, OCR, and chart comprehension
Code generation and completion
Multilingual support for 140+ languages
Native system prompt support

The model uses a hybrid attention mechanism that interleaves local sliding window attention with full global attention, with the final layer always using global attention. Global layers employ unified Keys and Values with Proportional RoPE to optimize memory for long contexts.

Availability

The assistant model is available now on Hugging Face under Apache 2.0 license. It requires the latest version of Transformers and works through speculative decoding pipelines where the assistant generates candidate tokens that the target model verifies.

What This Means

The 2x speedup claim positions this as a significant optimization for production deployments of Gemma 4 26B A4B, particularly for latency-sensitive applications. The MoE architecture's 3.8B active parameter count means it runs substantially faster than the 31B dense model while maintaining competitive performance on reasoning and coding benchmarks. However, the actual speedup will depend on hardware, batch size, and prompt characteristics—speculative decoding typically performs best on generation tasks with predictable patterns.

Source: huggingface.co ↗

gemma-4 google-deepmind mixture-of-experts speculative-decoding multimodal open-weights apache-2.0

model releaseMay 6, 2026

Google DeepMind releases Gemma 4 with 31B dense model, 256K context window, and speculative decoding drafters

Google DeepMind has released Gemma 4, a family of open-weight multimodal models including a 31B dense model with 256K context window and four size variants ranging from 2.3B to 30.7B effective parameters. The release includes Multi-Token Prediction (MTP) draft models that achieve up to 2x decoding speedup through speculative decoding while maintaining identical output quality.

model releaseApril 29, 2026

Mistral Releases Medium 3.5: 128B Dense Model With 256k Context and Configurable Reasoning

Mistral AI released Mistral Medium 3.5, a 128B parameter dense model with a 256k context window that unifies instruction-following, reasoning, and coding capabilities. The model features configurable reasoning effort per request and a vision encoder trained from scratch for variable image sizes.

model releaseApril 28, 2026

Poolside releases Laguna XS.2: 33B parameter MoE coding model with 131K context window

Poolside has released Laguna XS.2, a 33B total parameter Mixture-of-Experts model with 3B activated parameters per token, designed for agentic coding. The model features a 131,072-token context window, scores 68.2% on SWE-bench Verified, and is available under Apache 2.0 license with free API access.

model releaseMay 5, 2026

IBM releases Apache 2.0 Granite 4.1 LLMs in 3B, 8B, and 30B sizes

IBM has released the Granite 4.1 family of language models under Apache 2.0 license. The models come in 3B, 8B, and 30B parameter sizes. Unsloth has released 21 GGUF quantized variants of the 3B model ranging from 1.2GB to 6.34GB.

Google DeepMind Releases Gemma 4 26B A4B Assistant Model for 2x Faster Inference via Multi-Token Prediction

Gemma 4 26B A4B Assistant — Quick Specs

Google DeepMind Releases Gemma 4 26B A4B Assistant Model for 2x Faster Inference via Multi-Token Prediction

Technical Architecture

Benchmark Performance

Model Capabilities

Availability

What This Means

Related Articles

Google DeepMind releases Gemma 4 with 31B dense model, 256K context window, and speculative decoding drafters

Mistral Releases Medium 3.5: 128B Dense Model With 256k Context and Configurable Reasoning

Poolside releases Laguna XS.2: 33B parameter MoE coding model with 131K context window

IBM releases Apache 2.0 Granite 4.1 LLMs in 3B, 8B, and 30B sizes

Comments