model releaseGoogle DeepMind

Google DeepMind Releases Gemma 4 E4B with Multi-Token Prediction for 2x Faster Inference

TL;DR

Google DeepMind released the Gemma 4 E4B assistant model using Multi-Token Prediction (MTP) architecture that accelerates inference by up to 2x through speculative decoding. The 4.5B effective parameter model supports 128K context windows and handles text, image, and audio input with pricing not yet disclosed.

3 min read
0

Google DeepMind Releases Gemma 4 E4B with Multi-Token Prediction for 2x Faster Inference

Google DeepMind released the Gemma 4 E4B assistant model, a specialized Multi-Token Prediction (MTP) drafter that accelerates inference by up to 2x when used in speculative decoding pipelines. The model is designed for low-latency and on-device applications while maintaining identical output quality to standard generation.

Model Specifications

The Gemma 4 E4B features 4.5B effective parameters (8B total with embeddings) across 42 layers. The model employs Per-Layer Embeddings (PLE) architecture, where each decoder layer maintains its own embedding table—resulting in a smaller effective parameter count optimized for on-device deployment.

Key specifications include:

  • Context window: 128K tokens
  • Sliding window: 512 tokens
  • Vocabulary size: 262K tokens
  • Modalities: Text, image, and audio input with text output
  • Vision encoder: ~150M parameters
  • Audio encoder: ~300M parameters
  • License: Apache 2.0

How Multi-Token Prediction Works

MTP extends the base Gemma 4 model with a smaller, faster draft model that predicts multiple tokens ahead. The target model then verifies these predictions in parallel, significantly reducing latency without compromising output quality. According to Google DeepMind, this architecture is "perfect for low-latency and on-device applications."

Benchmark Performance

Google DeepMind reports the following scores for the instruction-tuned E4B model:

  • MMLU Pro: 69.4%
  • AIME 2026 (no tools): 42.5%
  • LiveCodeBench v6: 52.0%
  • Codeforces ELO: 940
  • GPQA Diamond: 58.6%
  • Vision MMMU Pro: 52.6%
  • CoVoST (audio): 35.54

These scores place the E4B below the larger 31B dense model (MMLU Pro: 85.2%) and 26B MoE model (MMLU Pro: 82.6%), but ahead of the smaller E2B variant.

Technical Architecture

The model uses a hybrid attention mechanism that alternates between local sliding window attention (512 tokens) and full global attention. The final layer always employs global attention. Global layers use unified Keys and Values with Proportional RoPE (p-RoPE) to optimize memory for long-context processing.

Multimodal Capabilities

The E4B handles variable aspect ratio and resolution images, video frame sequences, and audio input. Capabilities include document parsing, OCR across multiple languages, handwriting recognition, automatic speech recognition, and speech-to-translated-text translation.

The model supports native function calling for agentic workflows and includes a configurable "thinking mode" for step-by-step reasoning. It maintains multilingual support for 35+ languages out of the box, with pretraining on 140+ languages.

Deployment

The model is available on Hugging Face and requires the latest version of Transformers. Implementation requires loading both the target Gemma 4 E4B model and the assistant drafter model to enable the speculative decoding pipeline.

Pricing details have not been disclosed.

What This Means

The MTP architecture represents a practical approach to accelerating large language model inference without quality degradation—critical for on-device and edge deployments where latency matters. The 2x speedup claim positions this as a direct competitor to other optimization techniques like quantization or distillation, but with the advantage of maintaining exact output equivalence. The E4B's multimodal support and 128K context window make it viable for real-world applications on consumer hardware, though the lack of disclosed pricing leaves deployment costs uncertain for commercial users.

Related Articles

model release

NVIDIA Releases Quantized DiffusionGemma 26B: 1,100+ Tokens/Second with 256K Context Window

NVIDIA released a quantized version of Google DeepMind's DiffusionGemma 26B A4B IT, a multimodal model with 25.2B total parameters (3.8B active) that processes text, image, and video inputs. The NVFP4-quantized model achieves generation speeds exceeding 1,100 tokens per second on NVIDIA H100 GPUs while supporting a 256K token context window.

model release

Krea Releases 12-Billion Parameter Text-to-Image Model with 8-Step Generation

Krea.ai released Krea 2 Turbo, a 12-billion parameter diffusion transformer model for text-to-image generation. The open-weight model generates images in 8 inference steps and supports resolutions up to 2048x2048 pixels.

model release

Mistral OCR 4 Launches With Bounding Boxes, 170 Language Support at $2-4 Per 1,000 Pages

Mistral AI released OCR 4, a compact document extraction model that returns bounding boxes, block classification, and inline confidence scores alongside text. The model supports 170 languages, scores 85.20 on OlmOCRBench, and is priced at $4 per 1,000 pages via API ($2 with batch discount) or $5 per 1,000 pages through Document AI.

model release

Z.ai's GLM-5.2 Matches Claude Opus 4.8 in Agent Tasks, First Open Model to Compete in Coding

Z.ai released GLM-5.2 on June 16, 2026, the first open-weight model to match proprietary models like Claude Opus 4.8 on agent benchmarks. The MIT-licensed model closes the performance gap to 6.8 months behind frontier labs, down from expected 9+ months as compute scales.

Comments

Loading...