NVIDIA Nemotron 3 Super now available on Amazon Bedrock with 256K context window
NVIDIA Nemotron 3 Super, a hybrid Mixture of Experts model with 120B parameters and 12B active parameters, is now available as a fully managed model on Amazon Bedrock. The model supports up to 256K token context length and claims 5x higher throughput efficiency over the previous Nemotron Super and 2x higher accuracy on reasoning tasks.
NVIDIA Nemotron 3 Super — Quick Specs
NVIDIA Nemotron 3 Super launches on Amazon Bedrock
NVIDIA Nemotron 3 Super is now available as a fully managed, serverless model on Amazon Bedrock, joining the existing Nemotron Nano offerings. The model uses a hybrid Mixture of Experts (MoE) architecture optimized for agentic AI systems and multi-agent workflows without requiring infrastructure management.
Model specifications
Nemotron 3 Super is a 120B parameter model with 12B active parameters per token, using a latent MoE design that enables 4x more experts at the same inference cost. The model supports:
- Context length: 256K tokens
- Architecture: Hybrid Transformer-Mamba with latent MoE
- Active parameters: 12B (4x cost efficiency for inference)
- Input/output: Text only
- Supported languages: English, French, German, Italian, Japanese, Spanish, Chinese
- Multi-token prediction: Enabled for faster long reasoning sequences
Performance claims
NVIDIA claims the model achieves:
- 5x higher throughput efficiency over previous Nemotron Super
- 2x higher accuracy on reasoning and agentic tasks compared to the prior version
- Leading performance on AIME 2025, Terminal-Bench, SWE-Bench Verified, RULER, and multilingual benchmarks
- Token budget support for improved accuracy with minimal reasoning token generation
The model was trained using multi-environment reinforcement learning across 10+ environments via NVIDIA NeMo, according to the company.
Key capabilities
The model is positioned for use cases including:
- Software development: Code generation and summarization
- Finance: Loan processing, data extraction, fraud detection
- Cybersecurity: Issue triage, malware analysis, threat hunting
- Search: User intent understanding and agent activation
- Retail: Inventory optimization and personalized recommendations
- Multi-agent workflows: Orchestrating task-specific agents for complex business processes
Access and pricing
The model is available through Amazon Bedrock's Chat playground and programmatically via the model ID nvidia.nemotron-super-3-120b. It supports:
- Amazon Bedrock console interface
- InvokeModel and Converse APIs
- AWS CLI and SDKs
- OpenAI SDK compatibility
Specific pricing per 1M tokens for input and output has not been disclosed by AWS.
Technical details
Nemotron 3 Super uses latent MoE, where experts operate on shared latent representations before outputs project back to token space. This approach enables better specialization around semantic structures and multi-hop reasoning patterns. Multi-token prediction (MTP) allows the model to predict multiple future tokens in a single forward pass, reducing latency for chain-of-thought, planning, and code generation.
The model is released with open weights, datasets, and training recipes, enabling developers to customize and deploy locally for enhanced privacy and security.
What this means
Bedrock now offers a high-efficiency reasoning model positioned against proprietary alternatives for agentic workflows. The 12B active parameters claim suggests competitive inference costs while the 256K context enables longer reasoning chains. The open-weights approach differentiates from closed models, though actual latency, throughput, and cost metrics versus competing models on Bedrock (Claude, Llama) remain unverified. Organizations building multi-agent systems should benchmark this against existing options, particularly for reasoning-heavy tasks where the 2x accuracy improvement claim applies.
Related Articles
Nvidia releases Nemotron 3 Super: 120B MoE model with 1M token context
Nvidia has released Nemotron 3 Super, a 120-billion parameter hybrid Mamba-Transformer Mixture-of-Experts model that activates only 12 billion parameters during inference. The open-weight model features a 1-million token context window, multi-token prediction capabilities, and pricing at $0.10 per million input tokens and $0.50 per million output tokens.
NVIDIA releases Nemotron-3-Nano-4B, a 4B parameter model for edge AI with 262K context window
NVIDIA released Nemotron-3-Nano-4B-GGUF on March 16, 2026, a 4-billion parameter small language model (SLM) designed for edge deployment on devices like Jetson Thor and GeForce RTX. The model features a hybrid Mamba-2 and Transformer architecture with a 262K token context window and supports both reasoning and non-reasoning modes via system prompts.
NVIDIA releases Nemotron 3 Content Safety 4B for multimodal, multilingual moderation
NVIDIA released Nemotron 3 Content Safety 4B, an open-source multimodal safety model designed to moderate content across text, images, and multiple languages. Built on Gemma-3 4B-IT with a 128K context window, the model achieved 84% average accuracy on multimodal safety benchmarks and supports over 140 languages through culturally-aware training data.
OpenAI consolidating ChatGPT, Codex, and Atlas into single macOS superapp
OpenAI is consolidating its fragmented macOS app ecosystem by merging ChatGPT, Codex coding platform, and Atlas browser into a single "superapp" led by Chief of Applications Fidji Simo. The unified app will feature agentic AI capabilities for autonomous task execution and team collaboration, with rollout expected over coming months starting with Codex enhancements.