NVIDIA releases Nemotron-3-Ultra: 550B parameter model with 1M token context and configurable reasoning
NVIDIA released Nemotron-3-Ultra-550B, a frontier-scale model with 550B total parameters (55B active) and up to 1M token context window. The model uses a hybrid LatentMoE architecture combining Mamba-2, MoE, and attention layers with Multi-Token Prediction, trained with NVFP4 quantization-aware methods from December 2025 to April 2026.
Nemotron-3-Ultra-550B-A55B — Quick Specs
NVIDIA Releases Nemotron-3-Ultra: 550B Parameter Model with 1M Token Context
NVIDIA released Nemotron-3-Ultra-550B-A55B-BF16 on June 4, 2026, a frontier-scale language model with 550B total parameters and 55B active parameters. The model supports context windows up to 1M tokens and features configurable reasoning capabilities.
Architecture and Training
The model employs a hybrid LatentMoE (Latent Mixture-of-Experts) architecture that combines Mamba-2 layers, MoE layers, and attention layers. It incorporates Multi-Token Prediction (MTP) layers designed to accelerate text generation and improve output quality.
NVIDIA trained the model using an NVFP4 quantization-aware pre-training recipe from December 2025 to April 2026. Pre-training data has a cutoff date of September 2025, while post-training data extends to May 2026. The model was trained on approximately 20T tokens across code, math, science, and general knowledge datasets.
Hardware and Deployment
Minimum deployment requirements are substantial: 8x GB200/B200/GB300/B300 GPUs, 16x H100 GPUs, or 8x H200 GPUs. NVIDIA also released a quantized NVFP4 version (NVIDIA-Nemotron-3-Ultra-550B-A55B-NVFP4) for reduced memory footprint.
Benchmark Performance
According to NVIDIA, the model achieves competitive scores across multiple benchmarks:
- Agentic tasks: 56.4 on Terminal Bench 2.1, 71.9 on SWE-Bench Verified, 67.7 on SWE-Bench Multilingual
- Reasoning: 89.0 on LiveCodeBench v6, 88.6 on IMOAnswerBench (no tools), 86.8 on MMLU-Pro
- Long context: 94.7 on RULER (1M), 61.9 on Longbench v2 (≤1M)
- Code: 570.0 on IOI 2025
The model trails DeepSeek-v4-Pro and several other frontier models on benchmarks like Terminal Bench 2.1 (67.2 for Kimi-K2.6 vs 56.4) and GDPVal (54.7 for GLM-5.1 vs 46.7).
Key Features
The model supports 11 languages: English, French, Spanish, Italian, German, Japanese, Korean, Hindi, Brazilian Portuguese, and Chinese. Reasoning mode can be toggled via the chat template using enable_thinking=True/False.
NVIDIA released the model under the OpenMDW License Agreement version 1.1, allowing both commercial and non-commercial use. The company states the model is optimized for "complex agentic workflows, long-context analysis, and high-accuracy reasoning over code, math, and science."
What This Means
Nemotron-3-Ultra represents NVIDIA's entry into the ultra-large model space with a distinctive hybrid architecture that prioritizes efficiency through sparse activation (55B of 550B parameters active per token) and quantization-aware training. The 1M token context window positions it competitively for long-document analysis, though benchmark results show it trailing specialized models like DeepSeek-v4-Pro on several agentic and reasoning tasks. The substantial hardware requirements (minimum 8x H200 or 16x H100) limit deployment to well-resourced organizations, though the NVFP4 quantized version may broaden accessibility. The configurable reasoning mode offers flexibility for applications where step-by-step thinking traces are either required or need to be minimized for latency.
Related Articles
Nvidia Releases Nemotron 3 Ultra: 550B Parameter MoE Model with 1M Token Context Window
Nvidia has released Nemotron 3 Ultra, a 550B parameter mixture-of-experts model with 55B active parameters and a 1M token context window. The model uses a hybrid Transformer-Mamba architecture and is available for free through OpenRouter, targeting agentic workflows and multi-step reasoning tasks.
NVIDIA Releases Nemotron 3.5 ASR: 600M-Parameter Streaming Speech Model for 40 Languages
NVIDIA released Nemotron 3.5 ASR, a 600M-parameter speech-to-text model supporting 40 language-locales from a single checkpoint. The model achieves 0.07 seconds to final transcript after speech ends and ranks 2nd in latency among streaming ASR models according to Artificial Analysis benchmarks.
NVIDIA Nemotron 3 Ultra launches on AWS SageMaker with 550B parameters, 1M token context window
NVIDIA Nemotron 3 Ultra is now available on Amazon SageMaker JumpStart with 550 billion total parameters and 55 billion active parameters. The model features a hybrid Transformer-Mamba Mixture-of-Experts architecture and supports context windows up to 1 million tokens, targeting agentic AI workloads.
Nvidia Releases Free 4B-Parameter Nemotron 3.5 Content Safety Model with 128K Context
Nvidia has released Nemotron 3.5 Content Safety, a 4-billion parameter multimodal guardrail model fine-tuned from Google Gemma-3-4B. The model is available for free, supports 128K token context windows, and moderates content across 12 languages.
Comments
Loading...