model releaseNVIDIA

NVIDIA releases Nemotron-Labs-TwoTower-30B: block-wise diffusion model claims 2.42× faster generation at 98.7% baseline

TL;DR

NVIDIA released Nemotron-Labs-TwoTower-30B-A3B-Base-BF16, a block-wise diffusion language model that generates text by denoising blocks of tokens in parallel rather than sequentially. According to NVIDIA, the model achieves 2.42× the wall-clock generation throughput of its autoregressive baseline while retaining 98.7% of aggregate benchmark quality.

2 min read
0

NVIDIA Releases Nemotron-Labs-TwoTower-30B: Block-Wise Diffusion Model Claims 2.42× Faster Generation

NVIDIA released Nemotron-Labs-TwoTower-30B-A3B-Base-BF16, a block-wise diffusion language model that generates text by denoising blocks of tokens in parallel rather than sequentially. According to NVIDIA, the model achieves 2.42× the wall-clock generation throughput of its autoregressive baseline while retaining 98.7% of aggregate benchmark quality.

Architecture: Two Frozen and Trainable Towers

The model uses a dual-tower architecture built on the NVIDIA-Nemotron-3-Nano-30B-A3B-Base-BF16 backbone:

  • Context tower (AR/context): Frozen causal autoregressive tower that processes the prompt and previously committed tokens
  • Denoiser tower (diffusion/denoiser): Trainable tower that generates blocks of up to 16 tokens at a time via mask diffusion

Both towers consist of 52 layers combining Mamba-2, self-attention, and MoE components. Total model parameters: ~60B (30B per tower). Active parameters per token: ~3B per tower, with 128 routable experts of which 6 are activated plus 2 shared experts.

The denoiser tower uses bidirectional in-block attention, layer-aligned cross-attention to the context tower, and context-seeded Mamba-2 states. Time conditioning is handled via adaLN-single modulation (PixArt-α style).

Training and Data

Training occurred in two stages:

  1. Backbone pre-training: The single-tower baseline was pre-trained from scratch on ~25T tokens using next-token prediction
  2. Denoiser training: Only the diffusion/denoiser tower was trained (context tower frozen) using a masked-diffusion objective over ~2.1T tokens

Data cutoff: June 25, 2025. Model development: September 2025 – April 2026.

Benchmark Performance

Default configuration: confidence threshold γ = 0.8, block size 16, BF16 on 2×H100 GPUs.

Key results (diffusion vs. AR baseline):

  • MMLU (5-shot): 78.24 vs. 78.56
  • HumanEval (0-shot): 75.58 vs. 79.27
  • GSM8K (8-shot): 90.14 vs. 92.49
  • MATH-500 (4-shot): 80.60 vs. 84.40
  • ARC-Challenge (25-shot): 92.66 vs. 91.72

According to NVIDIA, the model retains 98.7% of aggregate baseline quality while delivering 2.42× throughput. Lowering the confidence threshold increases throughput but reduces quality.

Generation Modes

Three generation modes are available:

  1. Mask Diffusion: Block-wise iterative denoising (up to block_size tokens per step)
  2. Mock-AR: Two-tower autoregressive mode (1 token per step)
  3. AR: Standard autoregressive using context tower only (1 token per step)

Mask diffusion works by initializing blocks as all [MASK] tokens, then iteratively denoising over multiple steps. High-confidence positions are committed based on a confidence threshold, with remaining positions re-masked until the block is complete.

Availability

The model is released under the NVIDIA Nemotron Open Model License Agreement and is ready for commercial use. Released on Hugging Face as nvidia/Nemotron-Labs-TwoTower-30B-A3B-Base-BF16.

What This Means

This release represents NVIDIA's attempt to accelerate inference through parallel token generation rather than sequential decoding. The 2.42× speedup claim comes with a measured quality trade-off—the 3.8-point drop on HumanEval and 3.8-point drop on MATH-500 indicate meaningful degradation on technical tasks, despite the "98.7% retained quality" aggregate figure. The architecture's complexity (dual towers, cross-attention, Mamba-2 states) may limit adoption compared to simpler speculative decoding approaches that achieve similar speedups. Pricing and availability through NVIDIA's API endpoints have not been disclosed.

Related Articles

product update

AWS brings NVIDIA Nemotron and OpenAI GPT OSS models to GovCloud for secure government AI workloads

Amazon Bedrock now supports NVIDIA Nemotron and OpenAI GPT OSS models in AWS GovCloud (US) Regions. The launch includes OpenAI's GPT OSS models (120B and 20B parameters, 128K context) and NVIDIA Nemotron 3 family (9B to 120B parameters, 1M context), providing government agencies FedRAMP High and DoD SRG Level 5-compliant AI inference on U.S. soil.

model release

DeepSeek Releases V4 Models: 1M Context Window, 90% Less KV Cache Than V3

DeepSeek has released two new MoE models: DeepSeek-V4-Pro with 1.6T parameters (49B activated) and DeepSeek-V4-Flash with 284B parameters (13B activated). Both models support a one million token context window and use a hybrid attention architecture that requires only 27% of single-token inference FLOPs and 10% of KV cache compared to DeepSeek-V3.2.

model release

Mistral Releases Leanstral 1.5: 6B-Parameter Model Achieves 100% on miniF2F, Solves 587/672 PutnamBench Problems

Mistral AI released Leanstral 1.5, a free Apache-2.0 licensed model with 119B total parameters and 6B active parameters specialized for formal verification in Lean 4. The model achieves 100% on miniF2F benchmark, solves 587 of 672 PutnamBench problems at $4 per problem (versus $300+ for competitors), and reaches state-of-the-art 87% on FATE-H and 34% on FATE-X benchmarks.

model release

Anthropic Restores Claude Fable 5 After Government Takedown, With Stricter Cybersecurity Blocks

Anthropic is redeploying Claude Fable 5 after a month-long government-mandated takedown triggered by Amazon researchers discovering a method to bypass the model's cybersecurity safeguards. The returning version includes enhanced safety classifiers that automatically block cybersecurity tasks and revert to Opus 4.8, with restricted availability through usage credits only.

Comments

Loading...