model releaseNVIDIA

NVIDIA Nemotron 3 Nano Omni: 30B-parameter multimodal model launches on AWS SageMaker with 131K token context

TL;DR

NVIDIA has launched Nemotron 3 Nano Omni on Amazon SageMaker JumpStart, a multimodal model with 30 billion total parameters (3 billion active) that processes video, audio, images, and text in a single inference pass. The model features a 131K token context window and uses a Mamba2 Transformer Hybrid MoE architecture combining three specialized encoders.

2 min read
0

NVIDIA Nemotron 3 Nano Omni: 30B-parameter multimodal model launches on AWS SageMaker with 131K token context

NVIDIA has launched Nemotron 3 Nano Omni on Amazon SageMaker JumpStart, a multimodal model with 30 billion total parameters and 3 billion active parameters that processes video, audio, images, and text in a single inference pass.

Technical specifications

The model uses a Mamba2 Transformer Hybrid Mixture of Experts (MoE) architecture combining three components:

  • Nemotron 3 Nano LLM: Language backbone
  • CRADIO v4-H: Vision encoder for image and video understanding
  • Parakeet: Speech encoder for audio transcription

Key specifications:

  • Context window: 131,072 tokens
  • Total parameters: 30 billion
  • Active parameters: 3 billion (MoE)
  • Precision: FP8 on SageMaker
  • Video support: Up to 2 minutes, up to 256 frames (MP4)
  • Audio support: Up to 1 hour, 8kHz+ sampling rate (WAV, MP3)
  • Image formats: JPEG, PNG (RGB)

The model supports chain-of-thought reasoning, tool calling, JSON output, and word-level timestamps for transcription tasks. It is licensed under the NVIDIA Open Model Agreement for commercial use.

Architecture approach

According to AWS and NVIDIA, the unified architecture addresses a common pain point in enterprise AI systems: most agentic workflows currently stitch together separate models for vision, speech, and language. This fragmented approach increases latency through repeated inference passes, complicates orchestration, and amplifies costs.

Nemotron 3 Nano Omni processes all modalities in a single reasoning loop, eliminating the need for multiple model calls and maintaining converged multimodal context across reasoning loops.

Deployment and inference

The model is available through Amazon SageMaker JumpStart with one-click deployment. AWS recommends deploying on ml.p4d.24xlarge or ml.p5.48xlarge instances.

Recommended inference parameters vary by mode:

  • Thinking mode (complex reasoning): temperature 0.6, top_p 0.95, max_tokens 20,480
  • Instruct mode (general tasks, ASR): temperature 0.2, max_tokens 1,024

Enterprise applications

NVIDIA and AWS highlight several use cases:

Computer use agents: Reading screens, understanding UI state over time, and validating outcomes for incident management dashboards, browser automation, and email workflow agents.

Document intelligence: Interpreting contracts, financial documents, and scientific literature with mixed visual and text content.

Audio and video understanding: Meeting recording analysis, media asset management, drive-thru order verification, and customer service video review.

What this means

Nemotron 3 Nano Omni represents NVIDIA's entry into the unified multimodal model space, directly competing with offerings like GPT-4V and Gemini. The 131K context window is competitive but not leading—Claude 3.5 Sonnet offers 200K tokens, and Gemini 1.5 Pro supports up to 2 million tokens. The MoE architecture with 3B active parameters aims to reduce inference costs while maintaining capability, though pricing per million tokens was not disclosed. The key differentiation is the single-pass multimodal processing specifically optimized for agentic workflows, which could reduce orchestration complexity for enterprises building AI agents that need to process multiple input types simultaneously.

Related Articles

model release

Nvidia releases Nemotron 3 Ultra: 550B-parameter MoE model with 1M context window for agentic workflows

Nvidia has released Nemotron 3 Ultra, a 550-billion parameter mixture-of-experts model with 55 billion active parameters and support for up to 1 million token context windows. The model uses a hybrid Transformer-Mamba architecture and is designed specifically for long-running agentic workflows including agent orchestration, coding agents, and complex enterprise tasks.

model release

Moonshot AI releases Kimi K2.7 Code with 1T parameters, 256K context window, 30% lower thinking token usage

Moonshot AI has released Kimi K2.7 Code, a 1 trillion parameter Mixture-of-Experts model designed for long-horizon coding tasks. The model features a 256K context window and reduces thinking token usage by approximately 30% compared to its predecessor K2.6.

model release

Apple releases AFM 3 lineup: 20B-parameter on-device model and cloud AI running on Google's Nvidia infrastructure

Apple announced five third-generation foundation models at WWDC26, headlined by AFM 3 Core Advanced—a 20-billion-parameter sparse model that runs on-device by activating only 1-4 billion parameters at a time. For the first time, Apple extended Private Cloud Compute to third-party infrastructure, with AFM 3 Cloud Pro running on Nvidia GPUs in Google Cloud.

model release

Nex AGI Releases Nex-N2-Pro: 17B Active Parameter MoE Model with 262K Context Window

Nex AGI has released Nex-N2-Pro, a mixture-of-experts model with 17 billion active parameters from a total of 397 billion parameters. Built on the Qwen3.5 architecture, the model offers a 262,144 token context window and is available for free through OpenRouter.

Comments

Loading...