NVIDIA Nemotron 3 Ultra launches on AWS SageMaker with 550B parameters, 1M token context window
NVIDIA Nemotron 3 Ultra is now available on Amazon SageMaker JumpStart with 550 billion total parameters and 55 billion active parameters. The model features a hybrid Transformer-Mamba Mixture-of-Experts architecture and supports context windows up to 1 million tokens, targeting agentic AI workloads.
NVIDIA Nemotron 3 Ultra — Quick Specs
NVIDIA Nemotron 3 Ultra launches on AWS SageMaker with 550B parameters, 1M token context window
NVIDIA Nemotron 3 Ultra is now available on Amazon SageMaker JumpStart with 550 billion total parameters and 55 billion active parameters. The model uses a hybrid Transformer-Mamba Mixture-of-Experts (MoE) architecture and supports context windows up to 1 million tokens.
Model specifications
- Architecture: Hybrid Transformer-Mamba MoE
- Parameters: 550B total / 55B active per forward pass
- Context window: 1 million tokens
- Precision: NVFP4 format
- Modality: Text-to-text
The MoE architecture activates only 55 billion of the 550 billion total parameters per inference pass. According to NVIDIA, this design delivers 5x faster inference and up to 30% lower cost for agentic workloads compared to dense models of equivalent quality.
Deployment and pricing
Nemotron 3 Ultra deploys via one-click on SageMaker JumpStart using GPU instances including ml.p5en.48xlarge, ml.p5.48xlarge, or ml.g7e.48xlarge. AWS notes that these GPU instances cost several dollars per hour while running. Specific per-token pricing has not been disclosed.
The model is optimized for the NVFP4 format, a precision type designed to reduce hosting costs and improve inference speed.
Target use cases
NVIDIA positions Nemotron 3 Ultra specifically for multi-turn agentic workflows that span hundreds of interaction turns:
- Agent orchestration systems that coordinate multiple sub-agents
- Coding agents that generate, test, debug, and iterate on code across large repositories
- Research synthesis tasks requiring extended context coherence
- Multi-step enterprise automation with decision branching
The million-token context window allows agents to maintain state across extended tool-calling chains and planning loops.
Technical implementation
The hybrid Transformer-Mamba architecture combines traditional Transformer attention mechanisms with Mamba's structured state-space models. This architectural choice aims to maintain throughput at extended context lengths while keeping compute costs lower than dense models.
Developers can deploy using SageMaker Studio's interface or the SageMaker Python SDK. The model accepts standard chat completion payloads with configurable max_tokens, temperature, and top_p parameters.
Availability
Nemotron 3 Ultra is available immediately on Amazon SageMaker JumpStart. The model is described as "open" though specific licensing terms were not detailed in the announcement.
What this means
Nemotron 3 Ultra represents NVIDIA's direct entry into models purpose-built for agentic AI workflows. The 10:1 ratio between total and active parameters through MoE, combined with the 1M token context window, directly addresses the sustained compute demands of multi-turn agent interactions. The NVFP4 format optimization suggests NVIDIA is leveraging hardware-specific acceleration unavailable to other model providers. However, without independent benchmarks or disclosed per-token pricing, comparisons to existing agent-optimized models like Anthropic's Claude or GPT-4 remain speculative. The AWS-exclusive launch indicates strategic cloud partnership prioritization over broader distribution.
Related Articles
Nvidia Releases Nemotron 3 Ultra: 550B Parameter MoE Model with 1M Token Context Window
Nvidia has released Nemotron 3 Ultra, a 550B parameter mixture-of-experts model with 55B active parameters and a 1M token context window. The model uses a hybrid Transformer-Mamba architecture and is available for free through OpenRouter, targeting agentic workflows and multi-step reasoning tasks.
Nvidia Releases Free 4B-Parameter Nemotron 3.5 Content Safety Model with 128K Context
Nvidia has released Nemotron 3.5 Content Safety, a 4-billion parameter multimodal guardrail model fine-tuned from Google Gemma-3-4B. The model is available for free, supports 128K token context windows, and moderates content across 12 languages.
NVIDIA Releases Nemotron 3.5 ASR: 600M-Parameter Streaming Speech Model for 40 Languages
NVIDIA released Nemotron 3.5 ASR, a 600M-parameter speech-to-text model supporting 40 language-locales from a single checkpoint. The model achieves 0.07 seconds to final transcript after speech ends and ranks 2nd in latency among streaming ASR models according to Artificial Analysis benchmarks.
NVIDIA Shows Task-Seeded Synthetic Data Boosts Nemotron-3 Nano by +11.1 on GPQA
NVIDIA demonstrated that task-seeded synthetic Q&A data improves model performance across multiple benchmarks in a 100B-token continuation experiment on Nemotron-3 Nano. The approach improved GPQA scores by +11.1 points, MMLU-Pro by +1.8, average code by +1.9, and commonsense understanding by +1.6.
Comments
Loading...