NVIDIA Nemotron 3 Nano Omni: 30B-parameter multimodal model launches on AWS SageMaker with 131K token context
NVIDIA has launched Nemotron 3 Nano Omni on Amazon SageMaker JumpStart, a multimodal model with 30 billion total parameters (3 billion active) that processes video, audio, images, and text in a single inference pass. The model features a 131K token context window and uses a Mamba2 Transformer Hybrid MoE architecture combining three specialized encoders.
Nemotron 3 Nano Omni — Quick Specs
NVIDIA Nemotron 3 Nano Omni: 30B-parameter multimodal model launches on AWS SageMaker with 131K token context
NVIDIA has launched Nemotron 3 Nano Omni on Amazon SageMaker JumpStart, a multimodal model with 30 billion total parameters and 3 billion active parameters that processes video, audio, images, and text in a single inference pass.
Technical specifications
The model uses a Mamba2 Transformer Hybrid Mixture of Experts (MoE) architecture combining three components:
- Nemotron 3 Nano LLM: Language backbone
- CRADIO v4-H: Vision encoder for image and video understanding
- Parakeet: Speech encoder for audio transcription
Key specifications:
- Context window: 131,072 tokens
- Total parameters: 30 billion
- Active parameters: 3 billion (MoE)
- Precision: FP8 on SageMaker
- Video support: Up to 2 minutes, up to 256 frames (MP4)
- Audio support: Up to 1 hour, 8kHz+ sampling rate (WAV, MP3)
- Image formats: JPEG, PNG (RGB)
The model supports chain-of-thought reasoning, tool calling, JSON output, and word-level timestamps for transcription tasks. It is licensed under the NVIDIA Open Model Agreement for commercial use.
Architecture approach
According to AWS and NVIDIA, the unified architecture addresses a common pain point in enterprise AI systems: most agentic workflows currently stitch together separate models for vision, speech, and language. This fragmented approach increases latency through repeated inference passes, complicates orchestration, and amplifies costs.
Nemotron 3 Nano Omni processes all modalities in a single reasoning loop, eliminating the need for multiple model calls and maintaining converged multimodal context across reasoning loops.
Deployment and inference
The model is available through Amazon SageMaker JumpStart with one-click deployment. AWS recommends deploying on ml.p4d.24xlarge or ml.p5.48xlarge instances.
Recommended inference parameters vary by mode:
- Thinking mode (complex reasoning): temperature 0.6, top_p 0.95, max_tokens 20,480
- Instruct mode (general tasks, ASR): temperature 0.2, max_tokens 1,024
Enterprise applications
NVIDIA and AWS highlight several use cases:
Computer use agents: Reading screens, understanding UI state over time, and validating outcomes for incident management dashboards, browser automation, and email workflow agents.
Document intelligence: Interpreting contracts, financial documents, and scientific literature with mixed visual and text content.
Audio and video understanding: Meeting recording analysis, media asset management, drive-thru order verification, and customer service video review.
What this means
Nemotron 3 Nano Omni represents NVIDIA's entry into the unified multimodal model space, directly competing with offerings like GPT-4V and Gemini. The 131K context window is competitive but not leading—Claude 3.5 Sonnet offers 200K tokens, and Gemini 1.5 Pro supports up to 2 million tokens. The MoE architecture with 3B active parameters aims to reduce inference costs while maintaining capability, though pricing per million tokens was not disclosed. The key differentiation is the single-pass multimodal processing specifically optimized for agentic workflows, which could reduce orchestration complexity for enterprises building AI agents that need to process multiple input types simultaneously.
Related Articles
Nvidia releases Nemotron 3 Nano Omni: 30B-parameter multimodal model with 256K context, free on OpenRouter
Nvidia has released Nemotron 3 Nano Omni, a 30-billion-parameter multimodal model available free on OpenRouter. The model features a 256,000-token context window, accepts text, image, video, and audio inputs, and claims 2× higher throughput for video reasoning compared to separate vision and speech pipelines.
NVIDIA Releases Nemotron 3 Nano Omni: 30B-A3B Multimodal Model With 100+ Page Document Support
NVIDIA released Nemotron 3 Nano Omni, a 30B-A3B Mixture-of-Experts model that processes text, images, video, and audio. The model uses a hybrid Mamba-Transformer architecture with 128 experts and achieves 65.8 on OCRBenchV2-En and 72.2 on Video-MME, while delivering up to 9x higher throughput on multimodal tasks compared to alternatives.
Xiaomi releases MiMo-V2.5: 310B parameter omnimodal model with 1M token context window
Xiaomi released MiMo-V2.5, a 310B total parameter sparse mixture-of-experts model that activates 15B parameters per token. The omnimodal model supports text, image, video, and audio understanding with a 1M token context window and was trained on 48T tokens using FP8 mixed precision.
Xiaomi Releases MiMo-V2.5-Pro: 1.02T Parameter MoE Model with 1M Context Window
Xiaomi has released MiMo-V2.5-Pro, an open-source Mixture-of-Experts model with 1.02 trillion total parameters and 42 billion active parameters. The model supports up to 1 million tokens context length and claims 99.6% on GSM8K and 86.2% on MATH benchmarks.
Comments
Loading...