model releaseNVIDIA

NVIDIA releases Nemotron-Labs-Diffusion-14B with tri-mode decoding achieving 3.3x speed-up on GB200

TL;DR

NVIDIA released Nemotron-Labs-Diffusion-14B, a 14-billion parameter language model that supports three decoding modes by switching attention patterns during inference. The model achieves 850 tokens per second on GB200 hardware at concurrency 1, representing a 3.3x speed-up over standard autoregressive decoding and outperforming Qwen3-8B-Eagle3 by 2.2x in self-speculation mode.

2 min read
0

NVIDIA Releases Nemotron-Labs-Diffusion-14B with Tri-Mode Decoding

NVIDIA released Nemotron-Labs-Diffusion-14B, a 14-billion parameter language model that switches between autoregressive (AR), diffusion-based parallel decoding, and self-speculation modes by changing attention patterns during inference. According to NVIDIA, the model achieves 850 tokens per second on GB200 hardware at concurrency 1, representing a 3.3x speed-up compared to 253 tok/sec in standard AR mode.

Technical Architecture

The model family includes 3B, 8B, and 14B variants in base, instruct, and vision-language configurations. The architecture enables what NVIDIA calls "self-speculation": the same model performs diffusion-based parallel drafting and AR verification with shared KV cache. This approach shifts generation from memory-bound to compute-bound by loading model weights once and reusing them to compute multiple tokens.

The 8B variant shows 5.9x tokens per forward pass compared to Qwen3-8B without multi-token prediction, maintaining the same accuracy. In self-speculation mode, NVIDIA claims 3x higher acceptance length and 2.2x speed-up versus Qwen3-8B-Eagle3 in SGLang.

Performance Benchmarks

On DGX Spark hardware (8B model, concurrency 1), the model achieves 112 tok/sec using w4a16 quantization, representing 2.7x speed-up over AR's 41.8 tok/sec. On GB200, the 8B model reaches 850 tok/sec in self-speculation mode versus 360 tok/sec with Eagle3. Custom CUDA kernels push performance to 1,015 tok/sec, a 4x improvement over baseline AR.

NVIDIA's "speedup-of-light analysis" suggests throughput could double current best performance for single-user scenarios with improved sampling algorithms.

Implementation Details

The model supports three inference modes through simple API calls:

  • ar_generate() for standard autoregressive decoding
  • generate() for diffusion mode with configurable block length and threshold
  • linear_spec_generate() for self-speculation with optional LoRA adapter

An optional LoRA adapter can be applied to the diffusion drafter in linear self-speculation mode to increase acceptance length. The model requires transformers>=5.0.0 and runs on bfloat16 precision.

Availability

The model is available on Hugging Face under the NVIDIA Nemotron Open Model License. The release includes base model weights and a linear_spec LoRA adapter subfolder. NVIDIA provides example code for all three decoding modes with chat template support.

What This Means

This release represents a architectural shift in how language models handle inference efficiency. By enabling multiple decoding strategies within a single model through attention pattern switching, NVIDIA eliminates the need to deploy separate models for different latency-throughput tradeoffs. The self-speculation approach delivers substantial speed gains without external draft models, potentially reducing deployment complexity for organizations operating at varying concurrency levels. However, real-world performance will depend on workload characteristics and whether the compute-bound regime benefits materialize across diverse use cases.

Related Articles

model release

Alibaba Releases Qwen3.7 Max with 1M Token Context Window for Agent and Coding Tasks

Alibaba has released Qwen3.7 Max, the flagship model in its Qwen3.7 series, featuring a 1 million token context window. The text-only model is designed for agent-centric workloads with strengths in coding, office productivity, and long-horizon autonomous execution, and includes explicit prompt caching support.

model release

xAI Launches Grok Build 0.1: Coding Model with 256K Context for Agentic Workflows

xAI has released Grok Build 0.1, a coding-specialized model with a 256K context window and unlimited text output. The model is designed for agentic software engineering workflows and powers xAI's Grok Build CLI tool.

research

NVIDIA releases LoRA/DoRA fine-tuning guide for Cosmos Predict 2.5 to generate synthetic robot training data

NVIDIA published a technical guide for parameter-efficient fine-tuning of its Cosmos Predict 2.5 world model using LoRA and DoRA adapters. The method allows teams to adapt the 2B-parameter model to robot manipulation tasks on a single 80GB GPU, generating synthetic training trajectories from just 92 demonstration videos.

model release

Google releases Gemini 3.5 Flash and autonomous agent Gemini Spark at I/O 2026

Google announced Gemini 3.5 Flash and Gemini Spark at I/O 2026. Gemini 3.5 Flash now powers Google's AI Mode search, while Spark is a cloud-based autonomous agent that can monitor credit card statements, track emails, and interact with third-party services like OpenTable and Instacart.

Comments

Loading...