NVIDIA releases Nemotron-Labs-Diffusion-14B with tri-mode decoding achieving 3.3x speed-up on GB200
NVIDIA released Nemotron-Labs-Diffusion-14B, a 14-billion parameter language model that supports three decoding modes by switching attention patterns during inference. The model achieves 850 tokens per second on GB200 hardware at concurrency 1, representing a 3.3x speed-up over standard autoregressive decoding and outperforming Qwen3-8B-Eagle3 by 2.2x in self-speculation mode.
NVIDIA Releases Nemotron-Labs-Diffusion-14B with Tri-Mode Decoding
NVIDIA released Nemotron-Labs-Diffusion-14B, a 14-billion parameter language model that switches between autoregressive (AR), diffusion-based parallel decoding, and self-speculation modes by changing attention patterns during inference. According to NVIDIA, the model achieves 850 tokens per second on GB200 hardware at concurrency 1, representing a 3.3x speed-up compared to 253 tok/sec in standard AR mode.
Technical Architecture
The model family includes 3B, 8B, and 14B variants in base, instruct, and vision-language configurations. The architecture enables what NVIDIA calls "self-speculation": the same model performs diffusion-based parallel drafting and AR verification with shared KV cache. This approach shifts generation from memory-bound to compute-bound by loading model weights once and reusing them to compute multiple tokens.
The 8B variant shows 5.9x tokens per forward pass compared to Qwen3-8B without multi-token prediction, maintaining the same accuracy. In self-speculation mode, NVIDIA claims 3x higher acceptance length and 2.2x speed-up versus Qwen3-8B-Eagle3 in SGLang.
Performance Benchmarks
On DGX Spark hardware (8B model, concurrency 1), the model achieves 112 tok/sec using w4a16 quantization, representing 2.7x speed-up over AR's 41.8 tok/sec. On GB200, the 8B model reaches 850 tok/sec in self-speculation mode versus 360 tok/sec with Eagle3. Custom CUDA kernels push performance to 1,015 tok/sec, a 4x improvement over baseline AR.
NVIDIA's "speedup-of-light analysis" suggests throughput could double current best performance for single-user scenarios with improved sampling algorithms.
Implementation Details
The model supports three inference modes through simple API calls:
ar_generate()for standard autoregressive decodinggenerate()for diffusion mode with configurable block length and thresholdlinear_spec_generate()for self-speculation with optional LoRA adapter
An optional LoRA adapter can be applied to the diffusion drafter in linear self-speculation mode to increase acceptance length. The model requires transformers>=5.0.0 and runs on bfloat16 precision.
Availability
The model is available on Hugging Face under the NVIDIA Nemotron Open Model License. The release includes base model weights and a linear_spec LoRA adapter subfolder. NVIDIA provides example code for all three decoding modes with chat template support.
What This Means
This release represents a architectural shift in how language models handle inference efficiency. By enabling multiple decoding strategies within a single model through attention pattern switching, NVIDIA eliminates the need to deploy separate models for different latency-throughput tradeoffs. The self-speculation approach delivers substantial speed gains without external draft models, potentially reducing deployment complexity for organizations operating at varying concurrency levels. However, real-world performance will depend on workload characteristics and whether the compute-bound regime benefits materialize across diverse use cases.
Related Articles
Alibaba Releases Qwen3.7 Max with 1M Token Context Window for Agent and Coding Tasks
Alibaba has released Qwen3.7 Max, the flagship model in its Qwen3.7 series, featuring a 1 million token context window. The text-only model is designed for agent-centric workloads with strengths in coding, office productivity, and long-horizon autonomous execution, and includes explicit prompt caching support.
xAI Launches Grok Build 0.1: Coding Model with 256K Context for Agentic Workflows
xAI has released Grok Build 0.1, a coding-specialized model with a 256K context window and unlimited text output. The model is designed for agentic software engineering workflows and powers xAI's Grok Build CLI tool.
NVIDIA releases LoRA/DoRA fine-tuning guide for Cosmos Predict 2.5 to generate synthetic robot training data
NVIDIA published a technical guide for parameter-efficient fine-tuning of its Cosmos Predict 2.5 world model using LoRA and DoRA adapters. The method allows teams to adapt the 2B-parameter model to robot manipulation tasks on a single 80GB GPU, generating synthetic training trajectories from just 92 demonstration videos.
Google releases Gemini 3.5 Flash and autonomous agent Gemini Spark at I/O 2026
Google announced Gemini 3.5 Flash and Gemini Spark at I/O 2026. Gemini 3.5 Flash now powers Google's AI Mode search, while Spark is a cloud-based autonomous agent that can monitor credit card statements, track emails, and interact with third-party services like OpenTable and Instacart.
Comments
Loading...