NVIDIA Releases Quantized DiffusionGemma 26B: 1,100+ Tokens/Second with 256K Context Window

TL;DR

NVIDIA released a quantized version of Google DeepMind's DiffusionGemma 26B A4B IT, a multimodal model with 25.2B total parameters (3.8B active) that processes text, image, and video inputs. The NVFP4-quantized model achieves generation speeds exceeding 1,100 tokens per second on NVIDIA H100 GPUs while supporting a 256K token context window.

June 17, 2026 · 12:06 PM2 min read

DiffusionGemma 26B A4B IT NVFP4 — Quick Specs

Context window262K tokens

Compare DiffusionGemma 26B A4B IT NVFP4 with other models →

NVIDIA Releases Quantized DiffusionGemma 26B: 1,100+ Tokens/Second with 256K Context Window

NVIDIA released a quantized version of Google DeepMind's DiffusionGemma 26B A4B IT, a multimodal model with 25.2B total parameters (3.8B active) that processes text, image, and video inputs. The NVFP4-quantized model achieves generation speeds exceeding 1,100 tokens per second at low batch sizes on NVIDIA H100 GPUs while supporting a 256K token context window.

Technical Specifications

The model uses a Mixture-of-Experts (MoE) architecture built on Gemma 4 with 25.2B total parameters and 3.8B active parameters. NVIDIA quantized the weights and activations from 16 bits to 4 bits using Model Optimizer, reducing GPU memory requirements while maintaining benchmark performance within 1% of the full-precision baseline.

DiffusionGemma generates tokens in parallel 256-token blocks via discrete diffusion sampling with bidirectional attention, enabling the high generation speed. The model supports variable aspect ratios and resolutions for images through a configurable visual token budget (70, 140, 280, 560, or 1120 tokens per image) and processes videos up to 60 seconds at 1 frame per second.

Benchmark Performance

According to NVIDIA's evaluation with thinking mode enabled, the NVFP4-quantized model maintains near-parity with the BF16 baseline:

GPQA Diamond: 68.6% (baseline 69.4%)
AIME 2025: 67.33% (baseline 68.33%)
GSM8K: 94.01% (baseline 94.54%)
HumanEval: 95.00% (baseline 94.09%)
MMLU 0-Shot: 88.13% (baseline 88.50%)
MMLU Pro: 80.7% (baseline 81.0%)
IFEval: 94.56% (baseline 94.01%)

Key Features

The model includes native function calling, structured JSON output formatting, configurable thinking (reasoning) mode, and multilingual inference across 35+ languages. It supports text, image, and video inputs, with training data cutoff in January 2025.

Pricing information has not been disclosed. The model is available for commercial and non-commercial use under Apache 2.0 and Gemma Terms of Use, optimized for deployment on NVIDIA Hopper and Blackwell architectures via vLLM.

What This Means

NVIDIA's 4-bit quantization demonstrates that aggressive compression can maintain performance on academic benchmarks while delivering substantial efficiency gains. The 1,100+ tokens/second generation speed and 256K context window make this a competitive option for high-throughput multimodal applications, though real-world deployment will require validation on specific use cases. The MoE architecture's 3.8B active parameters out of 25.2B total suggests efficient inference scaling, but companies should verify the model's performance degradation on their proprietary evaluation sets before production deployment.

Source: huggingface.co ↗

NVIDIA Google DeepMind DiffusionGemma Gemma 4 quantization multimodal MoE discrete diffusion

model releaseJuly 29, 2026

Unsloth Releases GGUF Quantizations of Kimi K3, a 2.8T-Parameter Open-Weight MoE Model

Unsloth has released GGUF quantizations of Kimi K3, a 2.8-trillion-parameter open-weight Mixture-of-Experts model from Moonshot AI with a 1-million-token context window and native vision support. The largest lossless quantization (Q8) weighs in at 1.56TB.

model releaseJuly 31, 2026

Google DeepMind Launches Gemini Robotics 2, a Single VLA Model for Arms to Humanoids

Google DeepMind has introduced Gemini Robotics 2, a vision-language-action model it calls its most advanced yet, designed to control everything from tabletop robot arms to full-body humanoids. The company also released Gemini Robotics ER 2, an embodied reasoning model that replaces ER 1.6.

model releaseJuly 31, 2026

DeepSeek Releases V4-Flash-0731, a 284B-Parameter Model That Beats Its Own Larger Pro Variant on Agentic Benchmarks

DeepSeek has shipped the full release of DeepSeek-V4-Flash-0731, a 284B-parameter model that according to DeepSeek outperforms its own larger V4-Pro (Preview) on agentic and coding benchmarks. Unsloth has published quantized GGUF versions, with lossless 8-bit weights requiring 162GB of storage.

model releaseJuly 31, 2026

Thinking Machines Lab Releases Inkling Small: 276B MoE Model with 524K Context Window

Thinking Machines Lab has released Inkling Small, an open-weight multimodal mixture-of-experts model with 12B active parameters out of 276B total and a 524K token context window. The model targets reasoning, coding, agentic workflows, and multilingual use cases at $0.58 per 1M input tokens and $1.44 per 1M output tokens.

NVIDIA Releases Quantized DiffusionGemma 26B: 1,100+ Tokens/Second with 256K Context Window

DiffusionGemma 26B A4B IT NVFP4 — Quick Specs

NVIDIA Releases Quantized DiffusionGemma 26B: 1,100+ Tokens/Second with 256K Context Window

Technical Specifications

Benchmark Performance

Key Features

What This Means

Related Articles

Unsloth Releases GGUF Quantizations of Kimi K3, a 2.8T-Parameter Open-Weight MoE Model

Google DeepMind Launches Gemini Robotics 2, a Single VLA Model for Arms to Humanoids

DeepSeek Releases V4-Flash-0731, a 284B-Parameter Model That Beats Its Own Larger Pro Variant on Agentic Benchmarks

Thinking Machines Lab Releases Inkling Small: 276B MoE Model with 524K Context Window

Comments