model releaseGoogle DeepMind

Google DeepMind Releases Gemma 4 26B A4B Assistant Model for 2x Faster Inference via Multi-Token Prediction

TL;DR

Google DeepMind has released a Multi-Token Prediction assistant model for Gemma 4 26B A4B that achieves up to 2x decoding speedup through speculative decoding. The model uses 3.8B active parameters from a 25.2B total parameter MoE architecture with 128 experts and a 256K token context window.

2 min read
0

Google DeepMind Releases Gemma 4 26B A4B Assistant Model for 2x Faster Inference via Multi-Token Prediction

Google DeepMind has released a Multi-Token Prediction (MTP) drafter model for Gemma 4 26B A4B, designed to accelerate inference through speculative decoding. According to Google, the assistant model achieves up to 2x speedup while maintaining identical output quality to standard generation.

Technical Architecture

The Gemma 4 26B A4B base model uses a Mixture-of-Experts architecture with 25.2B total parameters but only 3.8B active parameters during inference. The model features:

  • 30 layers with 1024-token sliding window attention
  • 8 active experts selected from 128 total experts plus 1 shared expert
  • 256K token context window
  • 262K vocabulary size
  • ~550M parameter vision encoder for multimodal capabilities

The MTP assistant model extends this base by adding a smaller, faster draft model that predicts several tokens ahead. The target model then verifies these predictions in parallel, enabling the speedup without sacrificing quality.

Benchmark Performance

Google reports the following scores for the instruction-tuned 26B A4B model:

  • MMLU Pro: 82.6%
  • AIME 2026 (no tools): 88.3%
  • LiveCodeBench v6: 77.1%
  • Codeforces ELO: 1718
  • GPQA Diamond: 82.3%
  • Vision MMMU Pro: 73.8%
  • MATH-Vision: 82.4%

Model Capabilities

The model supports text and image input with variable aspect ratios and resolutions. Key capabilities include:

  • Native function calling for agentic workflows
  • Configurable reasoning modes with step-by-step thinking
  • Document parsing, OCR, and chart comprehension
  • Code generation and completion
  • Multilingual support for 140+ languages
  • Native system prompt support

The model uses a hybrid attention mechanism that interleaves local sliding window attention with full global attention, with the final layer always using global attention. Global layers employ unified Keys and Values with Proportional RoPE to optimize memory for long contexts.

Availability

The assistant model is available now on Hugging Face under Apache 2.0 license. It requires the latest version of Transformers and works through speculative decoding pipelines where the assistant generates candidate tokens that the target model verifies.

What This Means

The 2x speedup claim positions this as a significant optimization for production deployments of Gemma 4 26B A4B, particularly for latency-sensitive applications. The MoE architecture's 3.8B active parameter count means it runs substantially faster than the 31B dense model while maintaining competitive performance on reasoning and coding benchmarks. However, the actual speedup will depend on hardware, batch size, and prompt characteristics—speculative decoding typically performs best on generation tasks with predictable patterns.

Related Articles

model release

Mistral Releases Mistral 3 Family: 675B-Parameter Large 3 MoE and Three Edge Models Under Apache 2.0

Mistral has released Mistral 3, including Mistral Large 3—a sparse mixture-of-experts model with 41B active and 675B total parameters—and three Ministral 3 edge models (3B, 8B, 14B). All models are released under Apache 2.0 license with multimodal capabilities and are available today on multiple platforms.

model release

Amazon Bedrock adds Gemma 4 models with 256K context and built-in reasoning mode

Amazon Web Services today announced availability of Google DeepMind's Gemma 4 family on Amazon Bedrock. The open-weight models include three instruction-tuned variants spanning 2.3B to 30.7B parameters, with 256K context windows, multimodal input support, and built-in reasoning mode.

model release

Moonshot AI releases Kimi K2.7 Code with 1T parameters, 256K context window, 30% lower thinking token usage

Moonshot AI has released Kimi K2.7 Code, a 1 trillion parameter Mixture-of-Experts model designed for long-horizon coding tasks. The model features a 256K context window and reduces thinking token usage by approximately 30% compared to its predecessor K2.6.

model release

Mistral Releases Voxtral TTS: 4B Parameter Text-to-Speech Model at $0.016 per 1k Characters

Mistral AI has released Voxtral TTS, a 4B parameter text-to-speech model supporting 9 languages including English, French, German, Spanish, Dutch, Portuguese, Italian, Hindi, and Arabic. The model achieves 70ms latency for typical inputs and can clone voices from as little as 3 seconds of audio, priced at $0.016 per 1,000 characters.

Comments

Loading...