model releaseGoogle DeepMind

NVIDIA Releases Quantized DiffusionGemma 26B: 1,100+ Tokens/Second with 256K Context Window

TL;DR

NVIDIA released a quantized version of Google DeepMind's DiffusionGemma 26B A4B IT, a multimodal model with 25.2B total parameters (3.8B active) that processes text, image, and video inputs. The NVFP4-quantized model achieves generation speeds exceeding 1,100 tokens per second on NVIDIA H100 GPUs while supporting a 256K token context window.

2 min read
0

NVIDIA Releases Quantized DiffusionGemma 26B: 1,100+ Tokens/Second with 256K Context Window

NVIDIA released a quantized version of Google DeepMind's DiffusionGemma 26B A4B IT, a multimodal model with 25.2B total parameters (3.8B active) that processes text, image, and video inputs. The NVFP4-quantized model achieves generation speeds exceeding 1,100 tokens per second at low batch sizes on NVIDIA H100 GPUs while supporting a 256K token context window.

Technical Specifications

The model uses a Mixture-of-Experts (MoE) architecture built on Gemma 4 with 25.2B total parameters and 3.8B active parameters. NVIDIA quantized the weights and activations from 16 bits to 4 bits using Model Optimizer, reducing GPU memory requirements while maintaining benchmark performance within 1% of the full-precision baseline.

DiffusionGemma generates tokens in parallel 256-token blocks via discrete diffusion sampling with bidirectional attention, enabling the high generation speed. The model supports variable aspect ratios and resolutions for images through a configurable visual token budget (70, 140, 280, 560, or 1120 tokens per image) and processes videos up to 60 seconds at 1 frame per second.

Benchmark Performance

According to NVIDIA's evaluation with thinking mode enabled, the NVFP4-quantized model maintains near-parity with the BF16 baseline:

  • GPQA Diamond: 68.6% (baseline 69.4%)
  • AIME 2025: 67.33% (baseline 68.33%)
  • GSM8K: 94.01% (baseline 94.54%)
  • HumanEval: 95.00% (baseline 94.09%)
  • MMLU 0-Shot: 88.13% (baseline 88.50%)
  • MMLU Pro: 80.7% (baseline 81.0%)
  • IFEval: 94.56% (baseline 94.01%)

Key Features

The model includes native function calling, structured JSON output formatting, configurable thinking (reasoning) mode, and multilingual inference across 35+ languages. It supports text, image, and video inputs, with training data cutoff in January 2025.

Pricing information has not been disclosed. The model is available for commercial and non-commercial use under Apache 2.0 and Gemma Terms of Use, optimized for deployment on NVIDIA Hopper and Blackwell architectures via vLLM.

What This Means

NVIDIA's 4-bit quantization demonstrates that aggressive compression can maintain performance on academic benchmarks while delivering substantial efficiency gains. The 1,100+ tokens/second generation speed and 256K context window make this a competitive option for high-throughput multimodal applications, though real-world deployment will require validation on specific use cases. The MoE architecture's 3.8B active parameters out of 25.2B total suggests efficient inference scaling, but companies should verify the model's performance degradation on their proprietary evaluation sets before production deployment.

Related Articles

model release

Google DeepMind releases DiffusionGemma, a 26B parameter model generating 15-20 tokens per forward pass via discrete dif

Google DeepMind released DiffusionGemma, a 26B parameter mixture-of-experts model that generates text using discrete diffusion instead of autoregression. The model processes blocks of 256 tokens in parallel, achieving generation speeds exceeding 1100 tokens per second on H100 GPUs in low-batch settings.

model release

Amazon Bedrock adds Gemma 4 models with 256K context and built-in reasoning mode

Amazon Web Services today announced availability of Google DeepMind's Gemma 4 family on Amazon Bedrock. The open-weight models include three instruction-tuned variants spanning 2.3B to 30.7B parameters, with 256K context windows, multimodal input support, and built-in reasoning mode.

model release

Moonshot AI releases Kimi K2.7 Code with 1T parameters, 256K context window, 30% lower thinking token usage

Moonshot AI has released Kimi K2.7 Code, a 1 trillion parameter Mixture-of-Experts model designed for long-horizon coding tasks. The model features a 256K context window and reduces thinking token usage by approximately 30% compared to its predecessor K2.6.

model release

Apple releases AFM 3 lineup: 20B-parameter on-device model and cloud AI running on Google's Nvidia infrastructure

Apple announced five third-generation foundation models at WWDC26, headlined by AFM 3 Core Advanced—a 20-billion-parameter sparse model that runs on-device by activating only 1-4 billion parameters at a time. For the first time, Apple extended Private Cloud Compute to third-party infrastructure, with AFM 3 Cloud Pro running on Nvidia GPUs in Google Cloud.

Comments

Loading...