Google DeepMind releases DiffusionGemma, a 26B parameter model generating 15-20 tokens per forward pass via discrete dif

TL;DR

Google DeepMind released DiffusionGemma, a 26B parameter mixture-of-experts model that generates text using discrete diffusion instead of autoregression. The model processes blocks of 256 tokens in parallel, achieving generation speeds exceeding 1100 tokens per second on H100 GPUs in low-batch settings.

June 10, 2026 · 6:06 PM3 min read

DiffusionGemma 26B A4B IT — Quick Specs

Compare DiffusionGemma 26B A4B IT with other models →

Google DeepMind Releases DiffusionGemma: 26B Parameter Model Uses Discrete Diffusion for Faster Text Generation

Google DeepMind released DiffusionGemma, a 26B parameter multimodal model that generates text using discrete diffusion rather than traditional token-by-token autoregression. The model processes blocks of 256 tokens in parallel through iterative denoising, generating 15-20 tokens per forward pass and achieving speeds exceeding 1100 tokens per second on H100 GPUs at FP8 precision in low-batch scenarios.

Architecture and Technical Specifications

DiffusionGemma employs an encoder-decoder architecture built on the Gemma 4 26B A4B mixture-of-experts foundation. The model activates 8 experts out of 128 total, plus 1 shared expert, resulting in 3.8B active parameters from 25.2B total parameters. It supports context windows up to 256K tokens with a sliding window of 1024 tokens.

The encoder operates as a prefill mechanism, processing prompts and generating KV cache autoregressively. The decoder then uses bidirectional attention over a 256-token "canvas," accessing cached context via cross-attention. During multi-canvas sampling, the model iteratively denoises complete token blocks using a diffusion sampler. Once a canvas is fully denoised, it's processed by the encoder and appended to the KV cache before generating the next canvas.

The vision encoder contains approximately 550M parameters and processes images at variable aspect ratios and resolutions, as well as video sequences.

Benchmark Performance

According to Google DeepMind, DiffusionGemma scored 77.6% on MMLU Pro, 69.1% on AIME 2026 (no tools), and achieved a Codeforces ELO of 1429. On vision tasks, the model scored 54.3% on Vision MMMU Pro and 70.5% on MATH-Vision. These scores trail the standard Gemma 4 26B A4B model across all benchmarks tested—MMLU Pro (82.6%), AIME 2026 (88.3%), and Codeforces ELO (1718) for the autoregressive variant.

On long-context evaluation MRCR v2 8 needle at 128K tokens, DiffusionGemma averaged 32.0% compared to Gemma 4's 44.1%.

Recommended Sampling Configuration

Google DeepMind specifies using diffusion sampling with Entropy-Bounded Denoising and Adaptive Stopping for optimal performance. The configuration includes a maximum of 48 denoising steps, linear temperature decay from 0.8 to 0.4, and an entropy bound of 0.1 for token selection. Adaptive stopping occurs when average model entropy drops below 0.005 and token predictions stabilize across consecutive steps.

Capabilities and Availability

The model handles text, image, and video inputs to generate text output. Capabilities include document parsing, OCR across multiple languages, handwriting recognition, video analysis, function calling, and native reasoning mode via a <|think|> control token. DiffusionGemma supports 35+ languages out-of-box and was pre-trained on 140+ languages.

The model is available under Apache 2.0 license on Hugging Face and requires the latest Transformers library. It uses a 262K token vocabulary.

What This Means

DiffusionGemma represents a practical exploration of discrete diffusion for language generation, trading benchmark performance for inference speed in specific deployment scenarios. The 15-20 tokens per forward pass represents a meaningful architectural shift from standard autoregressive decoding, though the model's lower scores across reasoning, coding, and vision benchmarks indicate accuracy-speed tradeoffs. The approach may prove valuable for applications where generation speed outweighs task accuracy, particularly in single-user or low-batch environments with capable accelerators. However, the benchmark gaps suggest discrete diffusion models require further development to match autoregressive performance on complex reasoning tasks.

Source: huggingface.co ↗

DiffusionGemma Google DeepMind discrete diffusion mixture-of-experts multimodal Gemma inference speed encoder-decoder

model releaseJuly 25, 2026

Microsoft Releases Fara1.5-27B, a 27B Vision-Only Web Browsing Agent with 262K Context

Microsoft Research AI Frontiers has released Fara1.5-27B, a 27-billion-parameter multimodal agent that completes web tasks by reading screenshots and emitting click/type/scroll commands. The model, fine-tuned from Qwen3.5-27B, ships under MIT license with a 262K-token context window and is designed to run alongside Microsoft's MagenticLite sandbox.

model releaseJuly 23, 2026

InclusionAI Releases Ling-3.0-flash, a 124B MoE Model with 5.1B Active Parameters

InclusionAI has released Ling-3.0-flash, a 124-billion-parameter Mixture-of-Experts model that activates roughly 5.1 billion parameters per token. The model targets production-scale agentic workloads with a 262K context window and an emphasis on token efficiency.

model releaseJuly 23, 2026

Poolside Releases Laguna S 2.1, an 8B-Active-Parameter Open Coding Model That Rivals Systems 20x Its Size

Poolside has released Laguna S 2.1, a mixture-of-experts coding model with 8 billion active parameters out of 118 billion total, its third coding model release in three months. The company claims it outperforms open-weight models 10 to 20 times its size on agentic coding benchmarks like Terminal-Bench 2.1 and DeepSWE.

model releaseJuly 25, 2026

Anthropic's Claude Opus 5 Hits 0% Prompt Injection Success Rate in Browser Agent Tests, With Defenses Enabled

Anthropic's system card for Claude Opus 5 reports a 0% prompt injection success rate across 129 browser agent test scenarios when Auto Mode is enabled. On Gray Swan's broader indirect prompt injection benchmark, Opus 5 posted a 2.0% attacker success rate after 15 attempts, the lowest among tested frontier models.

Google DeepMind releases DiffusionGemma, a 26B parameter model generating 15-20 tokens per forward pass via discrete dif

DiffusionGemma 26B A4B IT — Quick Specs

Google DeepMind Releases DiffusionGemma: 26B Parameter Model Uses Discrete Diffusion for Faster Text Generation

Architecture and Technical Specifications

Benchmark Performance

Recommended Sampling Configuration

Capabilities and Availability

What This Means

Related Articles

Microsoft Releases Fara1.5-27B, a 27B Vision-Only Web Browsing Agent with 262K Context

InclusionAI Releases Ling-3.0-flash, a 124B MoE Model with 5.1B Active Parameters

Poolside Releases Laguna S 2.1, an 8B-Active-Parameter Open Coding Model That Rivals Systems 20x Its Size

Anthropic's Claude Opus 5 Hits 0% Prompt Injection Success Rate in Browser Agent Tests, With Defenses Enabled

Comments