Google DeepMind releases DiffusionGemma, a 26B parameter model generating 15-20 tokens per forward pass via discrete dif
Google DeepMind released DiffusionGemma, a 26B parameter mixture-of-experts model that generates text using discrete diffusion instead of autoregression. The model processes blocks of 256 tokens in parallel, achieving generation speeds exceeding 1100 tokens per second on H100 GPUs in low-batch settings.
Google DeepMind Releases DiffusionGemma: 26B Parameter Model Uses Discrete Diffusion for Faster Text Generation
Google DeepMind released DiffusionGemma, a 26B parameter multimodal model that generates text using discrete diffusion rather than traditional token-by-token autoregression. The model processes blocks of 256 tokens in parallel through iterative denoising, generating 15-20 tokens per forward pass and achieving speeds exceeding 1100 tokens per second on H100 GPUs at FP8 precision in low-batch scenarios.
Architecture and Technical Specifications
DiffusionGemma employs an encoder-decoder architecture built on the Gemma 4 26B A4B mixture-of-experts foundation. The model activates 8 experts out of 128 total, plus 1 shared expert, resulting in 3.8B active parameters from 25.2B total parameters. It supports context windows up to 256K tokens with a sliding window of 1024 tokens.
The encoder operates as a prefill mechanism, processing prompts and generating KV cache autoregressively. The decoder then uses bidirectional attention over a 256-token "canvas," accessing cached context via cross-attention. During multi-canvas sampling, the model iteratively denoises complete token blocks using a diffusion sampler. Once a canvas is fully denoised, it's processed by the encoder and appended to the KV cache before generating the next canvas.
The vision encoder contains approximately 550M parameters and processes images at variable aspect ratios and resolutions, as well as video sequences.
Benchmark Performance
According to Google DeepMind, DiffusionGemma scored 77.6% on MMLU Pro, 69.1% on AIME 2026 (no tools), and achieved a Codeforces ELO of 1429. On vision tasks, the model scored 54.3% on Vision MMMU Pro and 70.5% on MATH-Vision. These scores trail the standard Gemma 4 26B A4B model across all benchmarks tested—MMLU Pro (82.6%), AIME 2026 (88.3%), and Codeforces ELO (1718) for the autoregressive variant.
On long-context evaluation MRCR v2 8 needle at 128K tokens, DiffusionGemma averaged 32.0% compared to Gemma 4's 44.1%.
Recommended Sampling Configuration
Google DeepMind specifies using diffusion sampling with Entropy-Bounded Denoising and Adaptive Stopping for optimal performance. The configuration includes a maximum of 48 denoising steps, linear temperature decay from 0.8 to 0.4, and an entropy bound of 0.1 for token selection. Adaptive stopping occurs when average model entropy drops below 0.005 and token predictions stabilize across consecutive steps.
Capabilities and Availability
The model handles text, image, and video inputs to generate text output. Capabilities include document parsing, OCR across multiple languages, handwriting recognition, video analysis, function calling, and native reasoning mode via a <|think|> control token. DiffusionGemma supports 35+ languages out-of-box and was pre-trained on 140+ languages.
The model is available under Apache 2.0 license on Hugging Face and requires the latest Transformers library. It uses a 262K token vocabulary.
What This Means
DiffusionGemma represents a practical exploration of discrete diffusion for language generation, trading benchmark performance for inference speed in specific deployment scenarios. The 15-20 tokens per forward pass represents a meaningful architectural shift from standard autoregressive decoding, though the model's lower scores across reasoning, coding, and vision benchmarks indicate accuracy-speed tradeoffs. The approach may prove valuable for applications where generation speed outweighs task accuracy, particularly in single-user or low-batch environments with capable accelerators. However, the benchmark gaps suggest discrete diffusion models require further development to match autoregressive performance on complex reasoning tasks.
Related Articles
Google DeepMind releases Gemma 4 12B: encoder-free multimodal model runs on 16GB RAM
Google DeepMind has released Gemma 4 12B, a 12-billion parameter multimodal model that runs locally on laptops with 16GB of RAM. The model eliminates separate vision and audio encoders, processing raw inputs directly through its language model backbone under an Apache 2.0 license.
Google DeepMind Releases Quantization-Aware Training Versions of Gemma 4 Models in GGUF Format
Google DeepMind has released quantization-aware training (QAT) optimized versions of its Gemma 4 model family in GGUF Q4_0 format. The QAT versions preserve similar quality to bfloat16 while dramatically reducing memory requirements, with models available across the entire Gemma 4 lineup: E2B, E4B, 12B, 26B A4B, and 31B.
Nex AGI Releases Nex-N2-Pro: 17B Active Parameter MoE Model with 262K Context Window
Nex AGI has released Nex-N2-Pro, a mixture-of-experts model with 17 billion active parameters from a total of 397 billion parameters. Built on the Qwen3.5 architecture, the model offers a 262,144 token context window and is available for free through OpenRouter.
Nex AGI Releases Nex-N2-Pro: 397B Parameter MoE Model With 262K Context, Available Free
Nex AGI has released Nex-N2-Pro, an agentic mixture-of-experts model with 397B total parameters and 17B active parameters. The model features a 262K token context window and is available free via OpenRouter's API.
Comments
Loading...