Google DeepMind Releases Gemma 4 26B A4B Assistant Model for 2x Faster Inference via Multi-Token Prediction
Google DeepMind has released a Multi-Token Prediction assistant model for Gemma 4 26B A4B that achieves up to 2x decoding speedup through speculative decoding. The model uses 3.8B active parameters from a 25.2B total parameter MoE architecture with 128 experts and a 256K token context window.
Gemma 4 26B A4B Assistant — Quick Specs
Google DeepMind Releases Gemma 4 26B A4B Assistant Model for 2x Faster Inference via Multi-Token Prediction
Google DeepMind has released a Multi-Token Prediction (MTP) drafter model for Gemma 4 26B A4B, designed to accelerate inference through speculative decoding. According to Google, the assistant model achieves up to 2x speedup while maintaining identical output quality to standard generation.
Technical Architecture
The Gemma 4 26B A4B base model uses a Mixture-of-Experts architecture with 25.2B total parameters but only 3.8B active parameters during inference. The model features:
- 30 layers with 1024-token sliding window attention
- 8 active experts selected from 128 total experts plus 1 shared expert
- 256K token context window
- 262K vocabulary size
- ~550M parameter vision encoder for multimodal capabilities
The MTP assistant model extends this base by adding a smaller, faster draft model that predicts several tokens ahead. The target model then verifies these predictions in parallel, enabling the speedup without sacrificing quality.
Benchmark Performance
Google reports the following scores for the instruction-tuned 26B A4B model:
- MMLU Pro: 82.6%
- AIME 2026 (no tools): 88.3%
- LiveCodeBench v6: 77.1%
- Codeforces ELO: 1718
- GPQA Diamond: 82.3%
- Vision MMMU Pro: 73.8%
- MATH-Vision: 82.4%
Model Capabilities
The model supports text and image input with variable aspect ratios and resolutions. Key capabilities include:
- Native function calling for agentic workflows
- Configurable reasoning modes with step-by-step thinking
- Document parsing, OCR, and chart comprehension
- Code generation and completion
- Multilingual support for 140+ languages
- Native system prompt support
The model uses a hybrid attention mechanism that interleaves local sliding window attention with full global attention, with the final layer always using global attention. Global layers employ unified Keys and Values with Proportional RoPE to optimize memory for long contexts.
Availability
The assistant model is available now on Hugging Face under Apache 2.0 license. It requires the latest version of Transformers and works through speculative decoding pipelines where the assistant generates candidate tokens that the target model verifies.
What This Means
The 2x speedup claim positions this as a significant optimization for production deployments of Gemma 4 26B A4B, particularly for latency-sensitive applications. The MoE architecture's 3.8B active parameter count means it runs substantially faster than the 31B dense model while maintaining competitive performance on reasoning and coding benchmarks. However, the actual speedup will depend on hardware, batch size, and prompt characteristics—speculative decoding typically performs best on generation tasks with predictable patterns.
Related Articles
Google DeepMind releases Gemma 4 with 31B dense model, 256K context window, and speculative decoding drafters
Google DeepMind has released Gemma 4, a family of open-weight multimodal models including a 31B dense model with 256K context window and four size variants ranging from 2.3B to 30.7B effective parameters. The release includes Multi-Token Prediction (MTP) draft models that achieve up to 2x decoding speedup through speculative decoding while maintaining identical output quality.
Mistral Releases Medium 3.5: 128B Dense Model With 256k Context and Configurable Reasoning
Mistral AI released Mistral Medium 3.5, a 128B parameter dense model with a 256k context window that unifies instruction-following, reasoning, and coding capabilities. The model features configurable reasoning effort per request and a vision encoder trained from scratch for variable image sizes.
Poolside releases Laguna XS.2: 33B parameter MoE coding model with 131K context window
Poolside has released Laguna XS.2, a 33B total parameter Mixture-of-Experts model with 3B activated parameters per token, designed for agentic coding. The model features a 131,072-token context window, scores 68.2% on SWE-bench Verified, and is available under Apache 2.0 license with free API access.
IBM releases Apache 2.0 Granite 4.1 LLMs in 3B, 8B, and 30B sizes
IBM has released the Granite 4.1 family of language models under Apache 2.0 license. The models come in 3B, 8B, and 30B parameter sizes. Unsloth has released 21 GGUF quantized variants of the 3B model ranging from 1.2GB to 6.34GB.
Comments
Loading...