model releaseNVIDIA

Nvidia Releases Nemotron 3 Ultra: 550B Parameter MoE Model with 1M Token Context Window

TL;DR

Nvidia has released Nemotron 3 Ultra, a 550B parameter mixture-of-experts model with 55B active parameters and a 1M token context window. The model uses a hybrid Transformer-Mamba architecture and is available for free through OpenRouter, targeting agentic workflows and multi-step reasoning tasks.

June 4, 2026 · 1:50 PM2 min read

Nemotron 3 Ultra — Quick Specs

Context window1000K tokens

Input$0.5/1M tokens

Output$2.5/1M tokens

Compare Nemotron 3 Ultra with other models →

Nvidia Releases Nemotron 3 Ultra: 550B Parameter MoE Model with 1M Token Context Window

Nvidia has released Nemotron 3 Ultra, a 550B parameter mixture-of-experts (MoE) model with 55B active parameters and support for up to 1M token context windows. The model is available for free through OpenRouter.

Architecture and Specifications

Nemotron 3 Ultra uses a hybrid Transformer-Mamba mixture-of-experts architecture, with 55B parameters active during inference out of 550B total parameters. According to Nvidia, the model is designed for "high-throughput inference" in production agentic pipelines.

The model supports text-only input and output with a context window of 1M tokens. Nvidia states the model is "particularly strong at multi-step reasoning and planning."

Target Use Cases

Nvidia positions Nemotron 3 Ultra for:

Agent orchestration
Coding agents
Deep research tasks
Complex enterprise workflows
Long-running agentic pipelines

The model is part of Nvidia's Nemotron family of open models focused on agentic AI applications.

Availability and Pricing

Nemotron 3 Ultra is available for free through OpenRouter, which routes requests to providers capable of handling the model's context window and parameters. Pricing per 1M tokens not disclosed for direct API access. Model weights are available, though specific hosting details were not provided.

The release date listed as June 4, 2026, appears to be an error in the source documentation.

What This Means

Nvidia's entry into the 1M+ context window space with a free MoE model intensifies competition in the agent-focused model segment. The hybrid Transformer-Mamba architecture represents a technical bet on alternatives to pure Transformer architectures for long-context scenarios. The 55B active parameter configuration suggests Nvidia is optimizing for inference efficiency over raw parameter count, though benchmark scores remain undisclosed. Free availability through OpenRouter lowers the barrier for developers building multi-step reasoning applications, potentially accelerating adoption in enterprise agent workflows.

Source: openrouter.ai ↗

nvidia nemotron mixture-of-experts moe long-context 1m-tokens agentic-ai transformer-mamba

model releaseJuly 15, 2026

Mira Murati's Thinking Machines releases Inkling, 975B-parameter open-weight model trained on 45T tokens

Thinking Machines Lab released Inkling, a 975-billion-parameter mixture-of-experts model that uses 41 billion active parameters per task. The open-weight model was trained on 45 trillion tokens across text, image, audio, and video, marking the first public release from Mira Murati's AI startup.

product updateJuly 17, 2026

NVIDIA NeMo Automodel integrates with Hugging Face Diffusers for distributed video and image model fine-tuning

NVIDIA and Hugging Face have integrated NeMo Automodel with the Diffusers library, enabling distributed fine-tuning of video and image diffusion models without checkpoint conversion. The integration supports models including FLUX.1-dev (12B), Wan 2.1 (1.3B/14B), and HunyuanVideo (13B) with full fine-tuning and LoRA options.

model releaseJuly 17, 2026

Moonshot AI's Kimi k3 claims top performance among Chinese models with 1M token context

Moonshot AI has released Kimi k3, positioning it as China's leading AI model. The company claims the model features a 1 million token context window and improved reasoning capabilities, though independent benchmarks are not yet available.

benchmarkJuly 16, 2026

NVIDIA Nemotron 3 Embed 8B Tops RTEB Leaderboard with 78.5% Score, 1B Variant Cuts Error Rate 27%

NVIDIA's Nemotron-3-Embed-8B-BF16 ranks #1 on the RTEB leaderboard with a 78.5% score, while the 1B variant reduces error rate by 27% over its predecessor. The open-weight models feature 32k context windows and production-ready deployment options including a Blackwell-optimized NVFP4 variant.

Nvidia Releases Nemotron 3 Ultra: 550B Parameter MoE Model with 1M Token Context Window

Nemotron 3 Ultra — Quick Specs

Nvidia Releases Nemotron 3 Ultra: 550B Parameter MoE Model with 1M Token Context Window

Architecture and Specifications

Target Use Cases

Availability and Pricing

What This Means

Related Articles

Mira Murati's Thinking Machines releases Inkling, 975B-parameter open-weight model trained on 45T tokens

NVIDIA NeMo Automodel integrates with Hugging Face Diffusers for distributed video and image model fine-tuning

Moonshot AI's Kimi k3 claims top performance among Chinese models with 1M token context

NVIDIA Nemotron 3 Embed 8B Tops RTEB Leaderboard with 78.5% Score, 1B Variant Cuts Error Rate 27%

Comments