model releaseNVIDIA

NVIDIA releases Nemotron-3-Ultra: 550B parameter model with 1M token context and configurable reasoning

TL;DR

NVIDIA released Nemotron-3-Ultra-550B, a frontier-scale model with 550B total parameters (55B active) and up to 1M token context window. The model uses a hybrid LatentMoE architecture combining Mamba-2, MoE, and attention layers with Multi-Token Prediction, trained with NVFP4 quantization-aware methods from December 2025 to April 2026.

June 5, 2026 · 4:51 AM2 min read

Nemotron-3-Ultra-550B-A55B — Quick Specs

Context window1000K tokens

Compare Nemotron-3-Ultra-550B-A55B with other models →

NVIDIA Releases Nemotron-3-Ultra: 550B Parameter Model with 1M Token Context

NVIDIA released Nemotron-3-Ultra-550B-A55B-BF16 on June 4, 2026, a frontier-scale language model with 550B total parameters and 55B active parameters. The model supports context windows up to 1M tokens and features configurable reasoning capabilities.

Architecture and Training

The model employs a hybrid LatentMoE (Latent Mixture-of-Experts) architecture that combines Mamba-2 layers, MoE layers, and attention layers. It incorporates Multi-Token Prediction (MTP) layers designed to accelerate text generation and improve output quality.

NVIDIA trained the model using an NVFP4 quantization-aware pre-training recipe from December 2025 to April 2026. Pre-training data has a cutoff date of September 2025, while post-training data extends to May 2026. The model was trained on approximately 20T tokens across code, math, science, and general knowledge datasets.

Hardware and Deployment

Minimum deployment requirements are substantial: 8x GB200/B200/GB300/B300 GPUs, 16x H100 GPUs, or 8x H200 GPUs. NVIDIA also released a quantized NVFP4 version (NVIDIA-Nemotron-3-Ultra-550B-A55B-NVFP4) for reduced memory footprint.

Benchmark Performance

According to NVIDIA, the model achieves competitive scores across multiple benchmarks:

Agentic tasks: 56.4 on Terminal Bench 2.1, 71.9 on SWE-Bench Verified, 67.7 on SWE-Bench Multilingual
Reasoning: 89.0 on LiveCodeBench v6, 88.6 on IMOAnswerBench (no tools), 86.8 on MMLU-Pro
Long context: 94.7 on RULER (1M), 61.9 on Longbench v2 (≤1M)
Code: 570.0 on IOI 2025

The model trails DeepSeek-v4-Pro and several other frontier models on benchmarks like Terminal Bench 2.1 (67.2 for Kimi-K2.6 vs 56.4) and GDPVal (54.7 for GLM-5.1 vs 46.7).

Key Features

The model supports 11 languages: English, French, Spanish, Italian, German, Japanese, Korean, Hindi, Brazilian Portuguese, and Chinese. Reasoning mode can be toggled via the chat template using enable_thinking=True/False.

NVIDIA released the model under the OpenMDW License Agreement version 1.1, allowing both commercial and non-commercial use. The company states the model is optimized for "complex agentic workflows, long-context analysis, and high-accuracy reasoning over code, math, and science."

What This Means

Nemotron-3-Ultra represents NVIDIA's entry into the ultra-large model space with a distinctive hybrid architecture that prioritizes efficiency through sparse activation (55B of 550B parameters active per token) and quantization-aware training. The 1M token context window positions it competitively for long-document analysis, though benchmark results show it trailing specialized models like DeepSeek-v4-Pro on several agentic and reasoning tasks. The substantial hardware requirements (minimum 8x H200 or 16x H100) limit deployment to well-resourced organizations, though the NVFP4 quantized version may broaden accessibility. The configurable reasoning mode offers flexibility for applications where step-by-step thinking traces are either required or need to be minimized for latency.

Source: huggingface.co ↗

nvidia nemotron moe reasoning long-context open-weights mamba

model releaseJuly 20, 2026

Black Forest Labs releases FLUX.2: 32B open-weight image model with 4MP editing and 10-image multi-reference support

Black Forest Labs has released FLUX.2, a family of image generation models including a 32B parameter open-weight variant. The models support editing at up to 4 megapixel resolution and can reference up to 10 images simultaneously for character and style consistency.

model releaseJuly 17, 2026

Moonshot AI's Kimi k3 claims top performance among Chinese models with 1M token context

Moonshot AI has released Kimi k3, positioning it as China's leading AI model. The company claims the model features a 1 million token context window and improved reasoning capabilities, though independent benchmarks are not yet available.

benchmarkJuly 16, 2026

NVIDIA Nemotron 3 Embed 8B Tops RTEB Leaderboard with 78.5% Score, 1B Variant Cuts Error Rate 27%

NVIDIA's Nemotron-3-Embed-8B-BF16 ranks #1 on the RTEB leaderboard with a 78.5% score, while the 1B variant reduces error rate by 27% over its predecessor. The open-weight models feature 32k context windows and production-ready deployment options including a Blackwell-optimized NVFP4 variant.