model releaseNVIDIA

NVIDIA releases Nemotron-Labs-Diffusion-14B with tri-mode decoding achieving 3.3x speed-up on GB200

TL;DR

NVIDIA released Nemotron-Labs-Diffusion-14B, a 14-billion parameter language model that supports three decoding modes by switching attention patterns during inference. The model achieves 850 tokens per second on GB200 hardware at concurrency 1, representing a 3.3x speed-up over standard autoregressive decoding and outperforming Qwen3-8B-Eagle3 by 2.2x in self-speculation mode.

May 22, 2026 · 6:51 PM2 min read

NVIDIA Releases Nemotron-Labs-Diffusion-14B with Tri-Mode Decoding

NVIDIA released Nemotron-Labs-Diffusion-14B, a 14-billion parameter language model that switches between autoregressive (AR), diffusion-based parallel decoding, and self-speculation modes by changing attention patterns during inference. According to NVIDIA, the model achieves 850 tokens per second on GB200 hardware at concurrency 1, representing a 3.3x speed-up compared to 253 tok/sec in standard AR mode.

Technical Architecture

The model family includes 3B, 8B, and 14B variants in base, instruct, and vision-language configurations. The architecture enables what NVIDIA calls "self-speculation": the same model performs diffusion-based parallel drafting and AR verification with shared KV cache. This approach shifts generation from memory-bound to compute-bound by loading model weights once and reusing them to compute multiple tokens.

The 8B variant shows 5.9x tokens per forward pass compared to Qwen3-8B without multi-token prediction, maintaining the same accuracy. In self-speculation mode, NVIDIA claims 3x higher acceptance length and 2.2x speed-up versus Qwen3-8B-Eagle3 in SGLang.

Performance Benchmarks

On DGX Spark hardware (8B model, concurrency 1), the model achieves 112 tok/sec using w4a16 quantization, representing 2.7x speed-up over AR's 41.8 tok/sec. On GB200, the 8B model reaches 850 tok/sec in self-speculation mode versus 360 tok/sec with Eagle3. Custom CUDA kernels push performance to 1,015 tok/sec, a 4x improvement over baseline AR.

NVIDIA's "speedup-of-light analysis" suggests throughput could double current best performance for single-user scenarios with improved sampling algorithms.

Implementation Details

The model supports three inference modes through simple API calls:

ar_generate() for standard autoregressive decoding
generate() for diffusion mode with configurable block length and threshold
linear_spec_generate() for self-speculation with optional LoRA adapter

An optional LoRA adapter can be applied to the diffusion drafter in linear self-speculation mode to increase acceptance length. The model requires transformers>=5.0.0 and runs on bfloat16 precision.

Availability

The model is available on Hugging Face under the NVIDIA Nemotron Open Model License. The release includes base model weights and a linear_spec LoRA adapter subfolder. NVIDIA provides example code for all three decoding modes with chat template support.

What This Means

This release represents a architectural shift in how language models handle inference efficiency. By enabling multiple decoding strategies within a single model through attention pattern switching, NVIDIA eliminates the need to deploy separate models for different latency-throughput tradeoffs. The self-speculation approach delivers substantial speed gains without external draft models, potentially reducing deployment complexity for organizations operating at varying concurrency levels. However, real-world performance will depend on workload characteristics and whether the compute-bound regime benefits materialize across diverse use cases.

Source: huggingface.co ↗

NVIDIA Nemotron diffusion models inference optimization self-speculation GB200 language models model release

model releaseJuly 4, 2026

NVIDIA releases Nemotron-Labs-TwoTower-30B: block-wise diffusion model claims 2.42× faster generation at 98.7% baseline

NVIDIA released Nemotron-Labs-TwoTower-30B-A3B-Base-BF16, a block-wise diffusion language model that generates text by denoising blocks of tokens in parallel rather than sequentially. According to NVIDIA, the model achieves 2.42× the wall-clock generation throughput of its autoregressive baseline while retaining 98.7% of aggregate benchmark quality.

product updateJuly 1, 2026

AWS brings NVIDIA Nemotron and OpenAI GPT OSS models to GovCloud for secure government AI workloads

Amazon Bedrock now supports NVIDIA Nemotron and OpenAI GPT OSS models in AWS GovCloud (US) Regions. The launch includes OpenAI's GPT OSS models (120B and 20B parameters, 128K context) and NVIDIA Nemotron 3 family (9B to 120B parameters, 1M context), providing government agencies FedRAMP High and DoD SRG Level 5-compliant AI inference on U.S. soil.

model releaseJuly 1, 2026

Portugal releases Amália, open-source 9B parameter AI model trained on European Portuguese

Portugal has released Amália, its first national AI model trained specifically for European Portuguese. Built on EuroLLM-9B with 9 billion parameters, the model is fully open-source with weights, datasets, and code published under an open license. The government has committed €5.5m in initial funding through 2027.

model releaseJuly 6, 2026

Nex AGI releases Nex-N2-Mini: open-source agentic MoE model with 262K context window

Nex AGI has released Nex-N2-Mini, an open-source agentic mixture-of-experts model with a 262K-token context window. The model accepts text and image inputs and is priced at $0.025 per 1M input tokens and $0.10 per 1M output tokens.