model releaseNVIDIA

NVIDIA Releases Nemotron-3-Ultra: 550B Parameter Model with 1M Token Context and Configurable Reasoning

TL;DR

NVIDIA released Nemotron-3-Ultra-550B-A55B-NVFP4, a 550B parameter model with 55B active parameters, featuring a 1M token context window and configurable reasoning mode. The model uses a hybrid LatentMoE architecture combining Mamba-2, Mixture-of-Experts, and Attention layers with Multi-Token Prediction, trained with NVIDIA's NVFP4 quantization-aware approach.

2 min read
0

Nemotron 3 Ultra — Quick Specs

Context window1000K tokens
Input$0.5/1M tokens
Output$2.5/1M tokens

NVIDIA Releases Nemotron-3-Ultra: 550B Parameter Model with 1M Token Context and Configurable Reasoning

NVIDIA released Nemotron-3-Ultra-550B-A55B-NVFP4 on June 4, 2026, a frontier-scale language model with 550B total parameters and 55B active parameters. The model supports up to 1M token context windows and includes a configurable reasoning mode that can be toggled via the chat template.

Architecture and Technical Specifications

The model employs a hybrid LatentMoE (Latent Mixture-of-Experts) architecture that combines Mamba-2 state-space layers, MoE layers, and select Attention layers. According to NVIDIA, this architecture projects tokens into a smaller latent dimension for expert routing, improving "accuracy per byte."

Key specifications:

  • Total parameters: 550B (55B active)
  • Context length: Up to 1M tokens
  • Architecture: Mamba-2 + MoE + Attention hybrid with Multi-Token Prediction (MTP)
  • Training approach: NVFP4 quantization-aware pre-training
  • Minimum hardware: 4x GB200, 4x B200, 4x GB300, 4x B300, or 8x H100 GPUs
  • Supported languages: English, French, Spanish, Italian, German, Japanese, Korean, Hindi, Brazilian Portuguese, Chinese

The model incorporates Multi-Token Prediction layers using a shared-weight design across prediction heads, which NVIDIA claims enables faster inference via native speculative decoding.

Training Methodology

NVIDIA trained the model through four stages:

  1. Pre-training: 20T tokens using crawled and synthetic data across code, math, science, and general knowledge (cutoff: September 2025)
  2. Supervised fine-tuning: Synthetic data for code, math, science, tool calling, and instruction following
  3. Reinforcement learning: Asynchronous GRPO (Group Relative Policy Optimization) across multiple environments
  4. Multi-Domain On-Policy Distillation: Further refinement across domain-specific tasks

Post-training data has a cutoff date of May 2026. Training occurred between December 2025 and April 2026 using Megatron-LM and NeMo RL software.

Benchmark Performance

NVIDIA provided benchmark scores comparing BF16 and NVFP4 versions:

Agentic benchmarks:

  • Terminal Bench 2.1: 56.4 (BF16), 53.9 (NVFP4)
  • SWE-Bench Verified: 71.9 (BF16), 69.7 (NVFP4)
  • PinchBench: 90.0 (BF16), 89.8 (NVFP4)

Reasoning benchmarks:

  • GPQA (no tools): 87.0 (BF16), 87.9 (NVFP4)
  • IOI 2025: 570.0 (BF16), 564.7 (NVFP4)
  • OmniScience Accuracy: 24.1 (BF16), 24.6 (NVFP4)

Long context:

  • RULER 1M: 94.7 (BF16), 94.0 (NVFP4)
  • AA-LCR: 65.4 (BF16), 65.5 (NVFP4)

The NVFP4 version shows minimal performance degradation compared to BF16 across most benchmarks, with some tasks showing marginal improvements.

Availability and Licensing

The model is available on Hugging Face under the OpenMDW-1.1 license for both commercial and non-commercial use. Pricing for API access not yet disclosed.

What This Means

Nemotron-3-Ultra represents NVIDIA's entry into the 500B+ parameter frontier model space with a distinctive hybrid architecture that moves beyond pure transformer designs. The 1M token context window and configurable reasoning mode position it for long-document analysis and complex agentic workflows. The NVFP4 quantization approach appears effective at maintaining performance while reducing compute requirements—the minimal benchmark degradation suggests this training methodology could influence future large-scale model development. The model's hardware requirements (minimum 4x GB200 or 8x H100) make it accessible primarily to well-resourced organizations, though the open weights under OpenMDW-1.1 enable self-hosting for those with appropriate infrastructure.

Related Articles

model release

NVIDIA releases Nemotron-3-Ultra: 550B parameter model with 1M token context and configurable reasoning

NVIDIA released Nemotron-3-Ultra-550B, a frontier-scale model with 550B total parameters (55B active) and up to 1M token context window. The model uses a hybrid LatentMoE architecture combining Mamba-2, MoE, and attention layers with Multi-Token Prediction, trained with NVFP4 quantization-aware methods from December 2025 to April 2026.

model release

NVIDIA Releases Nemotron 3.5 ASR: 600M-Parameter Streaming Speech Model for 40 Languages

NVIDIA released Nemotron 3.5 ASR, a 600M-parameter speech-to-text model supporting 40 language-locales from a single checkpoint. The model achieves 0.07 seconds to final transcript after speech ends and ranks 2nd in latency among streaming ASR models according to Artificial Analysis benchmarks.

model release

Nvidia releases Nemotron 3 Ultra: 550B-parameter MoE model with 1M context window for agentic workflows

Nvidia has released Nemotron 3 Ultra, a 550-billion parameter mixture-of-experts model with 55 billion active parameters and support for up to 1 million token context windows. The model uses a hybrid Transformer-Mamba architecture and is designed specifically for long-running agentic workflows including agent orchestration, coding agents, and complex enterprise tasks.

model release

Nvidia Releases Free 4B-Parameter Nemotron 3.5 Content Safety Model with 128K Context

Nvidia has released Nemotron 3.5 Content Safety, a 4-billion parameter multimodal guardrail model fine-tuned from Google Gemma-3-4B. The model is available for free, supports 128K token context windows, and moderates content across 12 languages.

Comments

Loading...