researchNVIDIA

NVIDIA Releases Nemotron-Labs Diffusion Models With 6.4× Faster Token Generation Than Autoregressive Decoding

TL;DR

NVIDIA has released Nemotron-Labs Diffusion, a family of diffusion language models at 3B, 8B, and 14B scales that generate multiple tokens in parallel rather than one at a time. The 8B model achieves 6.4× higher tokens per forward pass than autoregressive models in self-speculation mode while maintaining comparable accuracy.

2 min read
0

NVIDIA Releases Nemotron-Labs Diffusion Models With 6.4× Faster Token Generation

NVIDIA has released Nemotron-Labs Diffusion, a family of diffusion language models that generate tokens in parallel rather than sequentially, achieving up to 6.4× higher tokens per forward pass than traditional autoregressive models.

Model Specifications

The Nemotron-Labs Diffusion family includes:

  • Text models: 3B, 8B, and 14B parameter versions
  • Vision-language model: 8B parameter multimodal variant
  • License: NVIDIA Nemotron Open Model License (text models), NVIDIA Source Code License (VLM)
  • Training data: 1.3T tokens for pretraining, 45B tokens for supervised fine-tuning
  • Release includes: Base models, instruction-tuned chat variants, and training code via NVIDIA Megatron Bridge framework

Performance Benchmarks

According to NVIDIA, the 8B model demonstrates:

  • 1.2% higher average accuracy than Qwen3 8B
  • 2.6× tokens per forward pass in diffusion mode versus autoregressive models
  • 6× TPF with linear self-speculation
  • 6.4× TPF with quadratic self-speculation
  • ~865 tokens/sec on B200 GPU in self-speculation mode (approximately 4× autoregressive baseline)

Tokens per forward pass (TPF) measures decoding efficiency independent of specific hardware configurations.

Three Generation Modes

Nemotron-Labs Diffusion supports three inference modes in a single model:

  1. Autoregressive mode: Standard left-to-right generation for compatibility with existing workflows
  2. Diffusion mode: Generates 32-token blocks in parallel, iteratively refining tokens across multiple denoising steps
  3. Self-speculation mode: Uses diffusion to draft candidate tokens, then verifies them autoregressively

Developers can switch between modes at deployment time without application-level changes.

Technical Architecture

The models build on recent research showing pretrained autoregressive models can be converted to diffusion language models through continued pretraining. Key design elements:

  • Block-wise attention mechanism enables KV-cache compatibility
  • Joint AR and diffusion training objective preserves original autoregressive capabilities
  • Confidence thresholding determines when generated tokens are committed
  • Built-in inference budget control through adjustable refinement steps

Unlike autoregressive models that finalize each token immediately, diffusion models can revise previously generated tokens, making them suitable for text editing and fill-in-the-middle tasks.

Deployment

Inference support is available through SGLang, with integration currently accessible via GitHub issue tracker. The same checkpoint can serve all three generation modes through a single configuration parameter (ar_mode).

NVIDIA reports self-speculation mode produces lossless output versus autoregressive decoding at temperature 0, maintaining deterministic compatibility.

What This Means

Nemotron-Labs Diffusion addresses a fundamental bottleneck in language model inference: memory bandwidth. Traditional autoregressive models spend most GPU time on memory operations rather than computation, particularly at small batch sizes. By generating and refining multiple tokens in parallel, diffusion models better utilize modern GPU architectures.

The ability to switch between AR and diffusion modes in the same model is the practical innovation here. Developers can deploy autoregressive mode for maximum compatibility, diffusion for throughput-critical workloads, or self-speculation when both speed and deterministic output matter. The 6.4× speedup claims remain to be independently verified, but if confirmed, this represents a meaningful shift in how inference-optimized models might be designed.

Related Articles

model release

NVIDIA Releases GR00T N1.7, 3B-Parameter Open-Source Humanoid Robot Model Trained on 20,854 Hours of Human Video

NVIDIA released GR00T N1.7, a 3-billion parameter open-source Vision-Language-Action model for humanoid robots with commercial licensing. The model was trained on 20,854 hours of human egocentric video data and demonstrates the first documented scaling law for robot dexterity, where increasing human video data from 1,000 to 20,000 hours more than doubles task completion rates.

model release

NVIDIA releases Nemotron-Labs-Diffusion-14B with tri-mode decoding achieving 3.3x speed-up on GB200

NVIDIA released Nemotron-Labs-Diffusion-14B, a 14-billion parameter language model that supports three decoding modes by switching attention patterns during inference. The model achieves 850 tokens per second on GB200 hardware at concurrency 1, representing a 3.3x speed-up over standard autoregressive decoding and outperforming Qwen3-8B-Eagle3 by 2.2x in self-speculation mode.

research

NVIDIA releases LoRA/DoRA fine-tuning guide for Cosmos Predict 2.5 to generate synthetic robot training data

NVIDIA published a technical guide for parameter-efficient fine-tuning of its Cosmos Predict 2.5 world model using LoRA and DoRA adapters. The method allows teams to adapt the 2B-parameter model to robot manipulation tasks on a single 80GB GPU, generating synthetic training trajectories from just 92 demonstration videos.

model release

NVIDIA releases Nemotron-3-Nano-Omni-30B, a 31B-parameter multimodal model with 256K context and reasoning mode

NVIDIA released Nemotron-3-Nano-Omni-30B-A3B, a multimodal large language model with 31 billion parameters that processes video, audio, images, and text with up to 256K token context. The model uses a Mamba2-Transformer hybrid Mixture of Experts architecture and supports chain-of-thought reasoning mode.

Comments

Loading...