NVIDIA Releases Nemotron-Labs Diffusion Models With 6.4× Faster Token Generation Than Autoregressive Decoding
NVIDIA has released Nemotron-Labs Diffusion, a family of diffusion language models at 3B, 8B, and 14B scales that generate multiple tokens in parallel rather than one at a time. The 8B model achieves 6.4× higher tokens per forward pass than autoregressive models in self-speculation mode while maintaining comparable accuracy.
NVIDIA Releases Nemotron-Labs Diffusion Models With 6.4× Faster Token Generation
NVIDIA has released Nemotron-Labs Diffusion, a family of diffusion language models that generate tokens in parallel rather than sequentially, achieving up to 6.4× higher tokens per forward pass than traditional autoregressive models.
Model Specifications
The Nemotron-Labs Diffusion family includes:
- Text models: 3B, 8B, and 14B parameter versions
- Vision-language model: 8B parameter multimodal variant
- License: NVIDIA Nemotron Open Model License (text models), NVIDIA Source Code License (VLM)
- Training data: 1.3T tokens for pretraining, 45B tokens for supervised fine-tuning
- Release includes: Base models, instruction-tuned chat variants, and training code via NVIDIA Megatron Bridge framework
Performance Benchmarks
According to NVIDIA, the 8B model demonstrates:
- 1.2% higher average accuracy than Qwen3 8B
- 2.6× tokens per forward pass in diffusion mode versus autoregressive models
- 6× TPF with linear self-speculation
- 6.4× TPF with quadratic self-speculation
- ~865 tokens/sec on B200 GPU in self-speculation mode (approximately 4× autoregressive baseline)
Tokens per forward pass (TPF) measures decoding efficiency independent of specific hardware configurations.
Three Generation Modes
Nemotron-Labs Diffusion supports three inference modes in a single model:
- Autoregressive mode: Standard left-to-right generation for compatibility with existing workflows
- Diffusion mode: Generates 32-token blocks in parallel, iteratively refining tokens across multiple denoising steps
- Self-speculation mode: Uses diffusion to draft candidate tokens, then verifies them autoregressively
Developers can switch between modes at deployment time without application-level changes.
Technical Architecture
The models build on recent research showing pretrained autoregressive models can be converted to diffusion language models through continued pretraining. Key design elements:
- Block-wise attention mechanism enables KV-cache compatibility
- Joint AR and diffusion training objective preserves original autoregressive capabilities
- Confidence thresholding determines when generated tokens are committed
- Built-in inference budget control through adjustable refinement steps
Unlike autoregressive models that finalize each token immediately, diffusion models can revise previously generated tokens, making them suitable for text editing and fill-in-the-middle tasks.
Deployment
Inference support is available through SGLang, with integration currently accessible via GitHub issue tracker. The same checkpoint can serve all three generation modes through a single configuration parameter (ar_mode).
NVIDIA reports self-speculation mode produces lossless output versus autoregressive decoding at temperature 0, maintaining deterministic compatibility.
What This Means
Nemotron-Labs Diffusion addresses a fundamental bottleneck in language model inference: memory bandwidth. Traditional autoregressive models spend most GPU time on memory operations rather than computation, particularly at small batch sizes. By generating and refining multiple tokens in parallel, diffusion models better utilize modern GPU architectures.
The ability to switch between AR and diffusion modes in the same model is the practical innovation here. Developers can deploy autoregressive mode for maximum compatibility, diffusion for throughput-critical workloads, or self-speculation when both speed and deterministic output matter. The 6.4× speedup claims remain to be independently verified, but if confirmed, this represents a meaningful shift in how inference-optimized models might be designed.
Related Articles
NVIDIA Releases GR00T N1.7, 3B-Parameter Open-Source Humanoid Robot Model Trained on 20,854 Hours of Human Video
NVIDIA released GR00T N1.7, a 3-billion parameter open-source Vision-Language-Action model for humanoid robots with commercial licensing. The model was trained on 20,854 hours of human egocentric video data and demonstrates the first documented scaling law for robot dexterity, where increasing human video data from 1,000 to 20,000 hours more than doubles task completion rates.
NVIDIA releases Nemotron-Labs-Diffusion-14B with tri-mode decoding achieving 3.3x speed-up on GB200
NVIDIA released Nemotron-Labs-Diffusion-14B, a 14-billion parameter language model that supports three decoding modes by switching attention patterns during inference. The model achieves 850 tokens per second on GB200 hardware at concurrency 1, representing a 3.3x speed-up over standard autoregressive decoding and outperforming Qwen3-8B-Eagle3 by 2.2x in self-speculation mode.
NVIDIA releases LoRA/DoRA fine-tuning guide for Cosmos Predict 2.5 to generate synthetic robot training data
NVIDIA published a technical guide for parameter-efficient fine-tuning of its Cosmos Predict 2.5 world model using LoRA and DoRA adapters. The method allows teams to adapt the 2B-parameter model to robot manipulation tasks on a single 80GB GPU, generating synthetic training trajectories from just 92 demonstration videos.
NVIDIA releases Nemotron-3-Nano-Omni-30B, a 31B-parameter multimodal model with 256K context and reasoning mode
NVIDIA released Nemotron-3-Nano-Omni-30B-A3B, a multimodal large language model with 31 billion parameters that processes video, audio, images, and text with up to 256K token context. The model uses a Mamba2-Transformer hybrid Mixture of Experts architecture and supports chain-of-thought reasoning mode.
Comments
Loading...