model releaseGoogle DeepMind

Google DeepMind Releases Gemma 4 E4B with Multi-Token Prediction for 2x Faster Inference

TL;DR

Google DeepMind released the Gemma 4 E4B assistant model using Multi-Token Prediction (MTP) architecture that accelerates inference by up to 2x through speculative decoding. The 4.5B effective parameter model supports 128K context windows and handles text, image, and audio input with pricing not yet disclosed.

3 min read
0

Google DeepMind Releases Gemma 4 E4B with Multi-Token Prediction for 2x Faster Inference

Google DeepMind released the Gemma 4 E4B assistant model, a specialized Multi-Token Prediction (MTP) drafter that accelerates inference by up to 2x when used in speculative decoding pipelines. The model is designed for low-latency and on-device applications while maintaining identical output quality to standard generation.

Model Specifications

The Gemma 4 E4B features 4.5B effective parameters (8B total with embeddings) across 42 layers. The model employs Per-Layer Embeddings (PLE) architecture, where each decoder layer maintains its own embedding table—resulting in a smaller effective parameter count optimized for on-device deployment.

Key specifications include:

  • Context window: 128K tokens
  • Sliding window: 512 tokens
  • Vocabulary size: 262K tokens
  • Modalities: Text, image, and audio input with text output
  • Vision encoder: ~150M parameters
  • Audio encoder: ~300M parameters
  • License: Apache 2.0

How Multi-Token Prediction Works

MTP extends the base Gemma 4 model with a smaller, faster draft model that predicts multiple tokens ahead. The target model then verifies these predictions in parallel, significantly reducing latency without compromising output quality. According to Google DeepMind, this architecture is "perfect for low-latency and on-device applications."

Benchmark Performance

Google DeepMind reports the following scores for the instruction-tuned E4B model:

  • MMLU Pro: 69.4%
  • AIME 2026 (no tools): 42.5%
  • LiveCodeBench v6: 52.0%
  • Codeforces ELO: 940
  • GPQA Diamond: 58.6%
  • Vision MMMU Pro: 52.6%
  • CoVoST (audio): 35.54

These scores place the E4B below the larger 31B dense model (MMLU Pro: 85.2%) and 26B MoE model (MMLU Pro: 82.6%), but ahead of the smaller E2B variant.

Technical Architecture

The model uses a hybrid attention mechanism that alternates between local sliding window attention (512 tokens) and full global attention. The final layer always employs global attention. Global layers use unified Keys and Values with Proportional RoPE (p-RoPE) to optimize memory for long-context processing.

Multimodal Capabilities

The E4B handles variable aspect ratio and resolution images, video frame sequences, and audio input. Capabilities include document parsing, OCR across multiple languages, handwriting recognition, automatic speech recognition, and speech-to-translated-text translation.

The model supports native function calling for agentic workflows and includes a configurable "thinking mode" for step-by-step reasoning. It maintains multilingual support for 35+ languages out of the box, with pretraining on 140+ languages.

Deployment

The model is available on Hugging Face and requires the latest version of Transformers. Implementation requires loading both the target Gemma 4 E4B model and the assistant drafter model to enable the speculative decoding pipeline.

Pricing details have not been disclosed.

What This Means

The MTP architecture represents a practical approach to accelerating large language model inference without quality degradation—critical for on-device and edge deployments where latency matters. The 2x speedup claim positions this as a direct competitor to other optimization techniques like quantization or distillation, but with the advantage of maintaining exact output equivalence. The E4B's multimodal support and 128K context window make it viable for real-world applications on consumer hardware, though the lack of disclosed pricing leaves deployment costs uncertain for commercial users.

Related Articles

model release

Google DeepMind releases Gemma 4 with 31B dense model, 256K context window, and speculative decoding drafters

Google DeepMind has released Gemma 4, a family of open-weight multimodal models including a 31B dense model with 256K context window and four size variants ranging from 2.3B to 30.7B effective parameters. The release includes Multi-Token Prediction (MTP) draft models that achieve up to 2x decoding speedup through speculative decoding while maintaining identical output quality.

model release

Google DeepMind Releases Gemma 4 26B A4B Assistant Model for 2x Faster Inference via Multi-Token Prediction

Google DeepMind has released a Multi-Token Prediction assistant model for Gemma 4 26B A4B that achieves up to 2x decoding speedup through speculative decoding. The model uses 3.8B active parameters from a 25.2B total parameter MoE architecture with 128 experts and a 256K token context window.

model release

Google releases Gemini 3.1 Flash Lite with 1M context at $0.25 per million input tokens

Google has released Gemini 3.1 Flash Lite, a high-efficiency multimodal model with a 1,048,576 token context window priced at $0.25 per million input tokens and $1.50 per million output tokens. The model supports text, image, video, audio, and PDF inputs with four thinking levels for cost-performance optimization.

model release

Zyphra Releases ZAYA1-8B: 8.4B Parameter MoE Model with 760M Active Parameters Matches 80B+ Models on Math Benchmarks

Zyphra has released ZAYA1-8B, a mixture-of-experts language model with 760M active parameters and 8.4B total parameters. The model scores 89.1% on AIME 2026, competitive with models exceeding 100B parameters, while maintaining efficiency for on-device deployment.

Comments

Loading...