model releaseNVIDIA

NVIDIA Releases Nemotron-3-Ultra: 550B Parameter Model with 1M Token Context and Configurable Reasoning

TL;DR

NVIDIA released Nemotron-3-Ultra-550B-A55B-NVFP4, a 550B parameter model with 55B active parameters, featuring a 1M token context window and configurable reasoning mode. The model uses a hybrid LatentMoE architecture combining Mamba-2, Mixture-of-Experts, and Attention layers with Multi-Token Prediction, trained with NVIDIA's NVFP4 quantization-aware approach.

June 5, 2026 · 2:06 PM2 min read

Nemotron 3 Ultra — Quick Specs

Context window1000K tokens

Input$0.5/1M tokens

Output$2.5/1M tokens

Compare Nemotron 3 Ultra with other models →

NVIDIA Releases Nemotron-3-Ultra: 550B Parameter Model with 1M Token Context and Configurable Reasoning

NVIDIA released Nemotron-3-Ultra-550B-A55B-NVFP4 on June 4, 2026, a frontier-scale language model with 550B total parameters and 55B active parameters. The model supports up to 1M token context windows and includes a configurable reasoning mode that can be toggled via the chat template.

Architecture and Technical Specifications

The model employs a hybrid LatentMoE (Latent Mixture-of-Experts) architecture that combines Mamba-2 state-space layers, MoE layers, and select Attention layers. According to NVIDIA, this architecture projects tokens into a smaller latent dimension for expert routing, improving "accuracy per byte."

Key specifications:

Total parameters: 550B (55B active)
Context length: Up to 1M tokens
Architecture: Mamba-2 + MoE + Attention hybrid with Multi-Token Prediction (MTP)
Training approach: NVFP4 quantization-aware pre-training
Minimum hardware: 4x GB200, 4x B200, 4x GB300, 4x B300, or 8x H100 GPUs
Supported languages: English, French, Spanish, Italian, German, Japanese, Korean, Hindi, Brazilian Portuguese, Chinese

The model incorporates Multi-Token Prediction layers using a shared-weight design across prediction heads, which NVIDIA claims enables faster inference via native speculative decoding.

Training Methodology

NVIDIA trained the model through four stages:

Pre-training: 20T tokens using crawled and synthetic data across code, math, science, and general knowledge (cutoff: September 2025)
Supervised fine-tuning: Synthetic data for code, math, science, tool calling, and instruction following
Reinforcement learning: Asynchronous GRPO (Group Relative Policy Optimization) across multiple environments
Multi-Domain On-Policy Distillation: Further refinement across domain-specific tasks

Post-training data has a cutoff date of May 2026. Training occurred between December 2025 and April 2026 using Megatron-LM and NeMo RL software.

Benchmark Performance

NVIDIA provided benchmark scores comparing BF16 and NVFP4 versions:

Agentic benchmarks:

Terminal Bench 2.1: 56.4 (BF16), 53.9 (NVFP4)
SWE-Bench Verified: 71.9 (BF16), 69.7 (NVFP4)
PinchBench: 90.0 (BF16), 89.8 (NVFP4)

Reasoning benchmarks:

GPQA (no tools): 87.0 (BF16), 87.9 (NVFP4)
IOI 2025: 570.0 (BF16), 564.7 (NVFP4)
OmniScience Accuracy: 24.1 (BF16), 24.6 (NVFP4)

Long context:

RULER 1M: 94.7 (BF16), 94.0 (NVFP4)
AA-LCR: 65.4 (BF16), 65.5 (NVFP4)

The NVFP4 version shows minimal performance degradation compared to BF16 across most benchmarks, with some tasks showing marginal improvements.

Availability and Licensing

The model is available on Hugging Face under the OpenMDW-1.1 license for both commercial and non-commercial use. Pricing for API access not yet disclosed.

What This Means

Nemotron-3-Ultra represents NVIDIA's entry into the 500B+ parameter frontier model space with a distinctive hybrid architecture that moves beyond pure transformer designs. The 1M token context window and configurable reasoning mode position it for long-document analysis and complex agentic workflows. The NVFP4 quantization approach appears effective at maintaining performance while reducing compute requirements—the minimal benchmark degradation suggests this training methodology could influence future large-scale model development. The model's hardware requirements (minimum 4x GB200 or 8x H100) make it accessible primarily to well-resourced organizations, though the open weights under OpenMDW-1.1 enable self-hosting for those with appropriate infrastructure.

Source: huggingface.co ↗

nvidia nemotron mixtureofexperts mamba reasoning longcontext model-release open-weights

model releaseJuly 20, 2026

NVIDIA Releases Nemotron-3-Embed-1B-BF16: 1.14B Parameter Multilingual Embedding Model with 2048-Dimensional Vectors

NVIDIA has released Nemotron-3-Embed-1B-BF16, a 1.14 billion parameter text embedding model supporting 34 languages with a 32,768 token context window. The model generates 2048-dimensional embeddings and was derived from Ministral-3-3B-Instruct-2512 through two rounds of structured pruning and distillation, first to 2B then to 1.14B parameters.

model releaseJuly 20, 2026

Black Forest Labs releases FLUX.2: 32B open-weight image model with 4MP editing and 10-image multi-reference support

Black Forest Labs has released FLUX.2, a family of image generation models including a 32B parameter open-weight variant. The models support editing at up to 4 megapixel resolution and can reference up to 10 images simultaneously for character and style consistency.

model releaseJuly 17, 2026

Moonshot AI's Kimi k3 claims top performance among Chinese models with 1M token context

Moonshot AI has released Kimi k3, positioning it as China's leading AI model. The company claims the model features a 1 million token context window and improved reasoning capabilities, though independent benchmarks are not yet available.

model releaseJuly 16, 2026

Moonshot AI releases 2.8T parameter Kimi K3, pricing at $3/$15 per million tokens

Chinese AI lab Moonshot AI released Kimi K3, a 2.8 trillion parameter model priced at $3 per million input tokens and $15 per million output tokens. The model is currently available via API, with open weights promised by July 27, 2026. This represents the most expensive pricing from a Chinese AI lab to date, matching Anthropic's Claude Sonnet series.

NVIDIA Releases Nemotron-3-Ultra: 550B Parameter Model with 1M Token Context and Configurable Reasoning

Nemotron 3 Ultra — Quick Specs

NVIDIA Releases Nemotron-3-Ultra: 550B Parameter Model with 1M Token Context and Configurable Reasoning

Architecture and Technical Specifications

Training Methodology

Benchmark Performance

Availability and Licensing

What This Means

Related Articles

NVIDIA Releases Nemotron-3-Embed-1B-BF16: 1.14B Parameter Multilingual Embedding Model with 2048-Dimensional Vectors

Black Forest Labs releases FLUX.2: 32B open-weight image model with 4MP editing and 10-image multi-reference support

Moonshot AI's Kimi k3 claims top performance among Chinese models with 1M token context

Moonshot AI releases 2.8T parameter Kimi K3, pricing at $3/$15 per million tokens

Comments