NVIDIA Releases Nemotron-Labs Diffusion Models With 6.4× Faster Token Generation Than Autoregressive Decoding

TL;DR

NVIDIA has released Nemotron-Labs Diffusion, a family of diffusion language models at 3B, 8B, and 14B scales that generate multiple tokens in parallel rather than one at a time. The 8B model achieves 6.4× higher tokens per forward pass than autoregressive models in self-speculation mode while maintaining comparable accuracy.

May 23, 2026 · 12:21 AM2 min read

Nemotron-Labs Diffusion 8B — Quick Specs

Compare Nemotron-Labs Diffusion 8B with other models →

NVIDIA Releases Nemotron-Labs Diffusion Models With 6.4× Faster Token Generation

NVIDIA has released Nemotron-Labs Diffusion, a family of diffusion language models that generate tokens in parallel rather than sequentially, achieving up to 6.4× higher tokens per forward pass than traditional autoregressive models.

Model Specifications

The Nemotron-Labs Diffusion family includes:

Text models: 3B, 8B, and 14B parameter versions
Vision-language model: 8B parameter multimodal variant
License: NVIDIA Nemotron Open Model License (text models), NVIDIA Source Code License (VLM)
Training data: 1.3T tokens for pretraining, 45B tokens for supervised fine-tuning
Release includes: Base models, instruction-tuned chat variants, and training code via NVIDIA Megatron Bridge framework

Performance Benchmarks

According to NVIDIA, the 8B model demonstrates:

1.2% higher average accuracy than Qwen3 8B
2.6× tokens per forward pass in diffusion mode versus autoregressive models
6× TPF with linear self-speculation
6.4× TPF with quadratic self-speculation
~865 tokens/sec on B200 GPU in self-speculation mode (approximately 4× autoregressive baseline)

Tokens per forward pass (TPF) measures decoding efficiency independent of specific hardware configurations.

Three Generation Modes

Nemotron-Labs Diffusion supports three inference modes in a single model:

Autoregressive mode: Standard left-to-right generation for compatibility with existing workflows
Diffusion mode: Generates 32-token blocks in parallel, iteratively refining tokens across multiple denoising steps
Self-speculation mode: Uses diffusion to draft candidate tokens, then verifies them autoregressively

Developers can switch between modes at deployment time without application-level changes.

Technical Architecture

The models build on recent research showing pretrained autoregressive models can be converted to diffusion language models through continued pretraining. Key design elements:

Block-wise attention mechanism enables KV-cache compatibility
Joint AR and diffusion training objective preserves original autoregressive capabilities
Confidence thresholding determines when generated tokens are committed
Built-in inference budget control through adjustable refinement steps

Unlike autoregressive models that finalize each token immediately, diffusion models can revise previously generated tokens, making them suitable for text editing and fill-in-the-middle tasks.

Deployment

Inference support is available through SGLang, with integration currently accessible via GitHub issue tracker. The same checkpoint can serve all three generation modes through a single configuration parameter (ar_mode).

NVIDIA reports self-speculation mode produces lossless output versus autoregressive decoding at temperature 0, maintaining deterministic compatibility.

What This Means

Nemotron-Labs Diffusion addresses a fundamental bottleneck in language model inference: memory bandwidth. Traditional autoregressive models spend most GPU time on memory operations rather than computation, particularly at small batch sizes. By generating and refining multiple tokens in parallel, diffusion models better utilize modern GPU architectures.

The ability to switch between AR and diffusion modes in the same model is the practical innovation here. Developers can deploy autoregressive mode for maximum compatibility, diffusion for throughput-critical workloads, or self-speculation when both speed and deterministic output matter. The 6.4× speedup claims remain to be independently verified, but if confirmed, this represents a meaningful shift in how inference-optimized models might be designed.

Source: huggingface.co ↗

nvidia diffusion-models inference-optimization language-models performance open-source

model releaseJuly 4, 2026

NVIDIA releases Nemotron-Labs-TwoTower-30B: block-wise diffusion model claims 2.42× faster generation at 98.7% baseline

NVIDIA released Nemotron-Labs-TwoTower-30B-A3B-Base-BF16, a block-wise diffusion language model that generates text by denoising blocks of tokens in parallel rather than sequentially. According to NVIDIA, the model achieves 2.42× the wall-clock generation throughput of its autoregressive baseline while retaining 98.7% of aggregate benchmark quality.

product updateJuly 1, 2026

AWS brings NVIDIA Nemotron and OpenAI GPT OSS models to GovCloud for secure government AI workloads

Amazon Bedrock now supports NVIDIA Nemotron and OpenAI GPT OSS models in AWS GovCloud (US) Regions. The launch includes OpenAI's GPT OSS models (120B and 20B parameters, 128K context) and NVIDIA Nemotron 3 family (9B to 120B parameters, 1M context), providing government agencies FedRAMP High and DoD SRG Level 5-compliant AI inference on U.S. soil.

model releaseJune 5, 2026

Nvidia releases Nemotron 3 Ultra: 550B-parameter MoE model with 1M context window for agentic workflows

Nvidia has released Nemotron 3 Ultra, a 550-billion parameter mixture-of-experts model with 55 billion active parameters and support for up to 1 million token context windows. The model uses a hybrid Transformer-Mamba architecture and is designed specifically for long-running agentic workflows including agent orchestration, coding agents, and complex enterprise tasks.