model releaseNVIDIA

NVIDIA Nemotron 3 Ultra launches on AWS SageMaker with 550B parameters, 1M token context window

TL;DR

NVIDIA Nemotron 3 Ultra is now available on Amazon SageMaker JumpStart with 550 billion total parameters and 55 billion active parameters. The model features a hybrid Transformer-Mamba Mixture-of-Experts architecture and supports context windows up to 1 million tokens, targeting agentic AI workloads.

June 4, 2026 · 5:06 PM2 min read

NVIDIA Nemotron 3 Ultra — Quick Specs

Context window1000K tokens

Compare NVIDIA Nemotron 3 Ultra with other models →

NVIDIA Nemotron 3 Ultra launches on AWS SageMaker with 550B parameters, 1M token context window

NVIDIA Nemotron 3 Ultra is now available on Amazon SageMaker JumpStart with 550 billion total parameters and 55 billion active parameters. The model uses a hybrid Transformer-Mamba Mixture-of-Experts (MoE) architecture and supports context windows up to 1 million tokens.

Model specifications

Architecture: Hybrid Transformer-Mamba MoE
Parameters: 550B total / 55B active per forward pass
Context window: 1 million tokens
Precision: NVFP4 format
Modality: Text-to-text

The MoE architecture activates only 55 billion of the 550 billion total parameters per inference pass. According to NVIDIA, this design delivers 5x faster inference and up to 30% lower cost for agentic workloads compared to dense models of equivalent quality.

Deployment and pricing

Nemotron 3 Ultra deploys via one-click on SageMaker JumpStart using GPU instances including ml.p5en.48xlarge, ml.p5.48xlarge, or ml.g7e.48xlarge. AWS notes that these GPU instances cost several dollars per hour while running. Specific per-token pricing has not been disclosed.

The model is optimized for the NVFP4 format, a precision type designed to reduce hosting costs and improve inference speed.

Target use cases

NVIDIA positions Nemotron 3 Ultra specifically for multi-turn agentic workflows that span hundreds of interaction turns:

Agent orchestration systems that coordinate multiple sub-agents
Coding agents that generate, test, debug, and iterate on code across large repositories
Research synthesis tasks requiring extended context coherence
Multi-step enterprise automation with decision branching

The million-token context window allows agents to maintain state across extended tool-calling chains and planning loops.

Technical implementation

The hybrid Transformer-Mamba architecture combines traditional Transformer attention mechanisms with Mamba's structured state-space models. This architectural choice aims to maintain throughput at extended context lengths while keeping compute costs lower than dense models.

Developers can deploy using SageMaker Studio's interface or the SageMaker Python SDK. The model accepts standard chat completion payloads with configurable max_tokens, temperature, and top_p parameters.

Availability

Nemotron 3 Ultra is available immediately on Amazon SageMaker JumpStart. The model is described as "open" though specific licensing terms were not detailed in the announcement.

What this means

Nemotron 3 Ultra represents NVIDIA's direct entry into models purpose-built for agentic AI workflows. The 10:1 ratio between total and active parameters through MoE, combined with the 1M token context window, directly addresses the sustained compute demands of multi-turn agent interactions. The NVFP4 format optimization suggests NVIDIA is leveraging hardware-specific acceleration unavailable to other model providers. However, without independent benchmarks or disclosed per-token pricing, comparisons to existing agent-optimized models like Anthropic's Claude or GPT-4 remain speculative. The AWS-exclusive launch indicates strategic cloud partnership prioritization over broader distribution.

Source: aws.amazon.com ↗

NVIDIA Nemotron 3 Ultra AWS SageMaker MoE agentic AI Transformer-Mamba 1M context

product updateJuly 17, 2026

NVIDIA NeMo Automodel integrates with Hugging Face Diffusers for distributed video and image model fine-tuning

NVIDIA and Hugging Face have integrated NeMo Automodel with the Diffusers library, enabling distributed fine-tuning of video and image diffusion models without checkpoint conversion. The integration supports models including FLUX.1-dev (12B), Wan 2.1 (1.3B/14B), and HunyuanVideo (13B) with full fine-tuning and LoRA options.

benchmarkJuly 16, 2026

NVIDIA Nemotron 3 Embed 8B Tops RTEB Leaderboard with 78.5% Score, 1B Variant Cuts Error Rate 27%

NVIDIA's Nemotron-3-Embed-8B-BF16 ranks #1 on the RTEB leaderboard with a 78.5% score, while the 1B variant reduces error rate by 27% over its predecessor. The open-weight models feature 32k context windows and production-ready deployment options including a Blackwell-optimized NVFP4 variant.

model releaseJuly 16, 2026

Nvidia Launches Cosmos 3 Edge World Model for Physical AI, Forms Japan Industrial Coalition

Nvidia released Cosmos 3 Edge, a world model designed for robots and vision AI agents to perceive and navigate physical environments in real time. The company announced partnerships with Japanese industrial giants including Fujitsu, Hitachi, and Kawasaki Heavy Industries as part of its physical AI expansion.

model releaseJuly 15, 2026

Mira Murati's Thinking Machines releases Inkling, 975B-parameter open-weight model trained on 45T tokens

Thinking Machines Lab released Inkling, a 975-billion-parameter mixture-of-experts model that uses 41 billion active parameters per task. The open-weight model was trained on 45 trillion tokens across text, image, audio, and video, marking the first public release from Mira Murati's AI startup.

NVIDIA Nemotron 3 Ultra launches on AWS SageMaker with 550B parameters, 1M token context window

NVIDIA Nemotron 3 Ultra — Quick Specs

NVIDIA Nemotron 3 Ultra launches on AWS SageMaker with 550B parameters, 1M token context window

Model specifications

Deployment and pricing

Target use cases

Technical implementation

Availability

What this means

Related Articles

NVIDIA NeMo Automodel integrates with Hugging Face Diffusers for distributed video and image model fine-tuning

NVIDIA Nemotron 3 Embed 8B Tops RTEB Leaderboard with 78.5% Score, 1B Variant Cuts Error Rate 27%

Nvidia Launches Cosmos 3 Edge World Model for Physical AI, Forms Japan Industrial Coalition

Mira Murati's Thinking Machines releases Inkling, 975B-parameter open-weight model trained on 45T tokens

Comments