Google releases DiffusionGemma 26B, open-weight model generates 500+ tokens/second
Google has released DiffusionGemma 26B, an open-weight text generation model under Apache 2 license. The model generates over 500 tokens/second according to testing on NVIDIA's free NIM API, where it produced 2,409 tokens in 4.4 seconds.
Google releases DiffusionGemma 26B, open-weight model generates 500+ tokens/second
Google has released DiffusionGemma 26B, an open-weight text generation model licensed under Apache 2. The model is based on Google's previously experimental Gemini Diffusion architecture from May 2025, which briefly appeared in preview before being withdrawn.
Performance metrics
The model demonstrates generation speeds exceeding 500 tokens per second. In testing on NVIDIA's NIM cloud API, DiffusionGemma 26B generated 2,409 tokens in 4.4 seconds when creating an image description. This represents a significant speed improvement over standard autoregressive language models.
Google's earlier Gemini Diffusion preview in May 2025 reportedly achieved 857 tokens per second, suggesting the architecture maintains high-speed generation capabilities.
Availability and access
The model is available as google/diffusiongemma-26B-A4B-it on Hugging Face. NVIDIA is currently hosting the model free of charge on their NIM cloud API platform, providing immediate access without local deployment requirements.
The 26B parameter model uses a diffusion-based approach to text generation rather than traditional autoregressive decoding, which enables parallel token generation and faster inference speeds.
Technical details
DiffusionGemma represents a departure from standard transformer architectures that generate tokens sequentially. Instead, the diffusion approach allows multiple tokens to be refined simultaneously during generation, similar to image diffusion models adapted for discrete text.
The "A4B" designation in the model name likely indicates architecture-specific configuration details, though Google has not released full technical specifications.
What this means
DiffusionGemma 26B validates diffusion architectures as a viable alternative to autoregressive generation for language models. The 500+ tokens/second speed, combined with Apache 2 licensing, makes this the fastest openly available language model by generation speed. This could shift inference economics for applications requiring high-throughput text generation, though quality comparisons with standard models like Llama or Gemma remain to be established through independent benchmarking.
Related Articles
NVIDIA Releases Nemotron-3-Ultra: 550B Parameter Model with 1M Token Context and Configurable Reasoning
NVIDIA released Nemotron-3-Ultra-550B-A55B-NVFP4, a 550B parameter model with 55B active parameters, featuring a 1M token context window and configurable reasoning mode. The model uses a hybrid LatentMoE architecture combining Mamba-2, Mixture-of-Experts, and Attention layers with Multi-Token Prediction, trained with NVIDIA's NVFP4 quantization-aware approach.
NVIDIA releases Nemotron-3-Ultra: 550B parameter model with 1M token context and configurable reasoning
NVIDIA released Nemotron-3-Ultra-550B, a frontier-scale model with 550B total parameters (55B active) and up to 1M token context window. The model uses a hybrid LatentMoE architecture combining Mamba-2, MoE, and attention layers with Multi-Token Prediction, trained with NVFP4 quantization-aware methods from December 2025 to April 2026.
NVIDIA Releases Nemotron 3.5 Content Safety: 4B-Parameter Multimodal Model with Custom Policy Enforcement and 140-Langua
NVIDIA has released Nemotron 3.5 Content Safety, a 4B-parameter model built on Google Gemma 3 4B IT that provides multimodal safety classification across approximately 140 languages. The model includes a 128K context window, custom enterprise policy enforcement, auditable reasoning traces, and is releasing its training dataset.
Google DeepMind releases DiffusionGemma, a 26B parameter model generating 15-20 tokens per forward pass via discrete dif
Google DeepMind released DiffusionGemma, a 26B parameter mixture-of-experts model that generates text using discrete diffusion instead of autoregression. The model processes blocks of 256 tokens in parallel, achieving generation speeds exceeding 1100 tokens per second on H100 GPUs in low-batch settings.
Comments
Loading...