NVIDIA Releases Nemotron 3 Nano Omni: 31B Multimodal Model With 256K Context and Reasoning Mode
NVIDIA released Nemotron 3 Nano Omni, a 31B parameter (30B active, 3B per token) multimodal model supporting video, audio, image, and text inputs. The model features a 256K token context window, reasoning mode with chain-of-thought, and tool calling capabilities.
NVIDIA Nemotron 3 Nano Omni 30B A3B Reasoning — Quick Specs
NVIDIA Releases Nemotron 3 Nano Omni: 31B Multimodal Model With 256K Context and Reasoning Mode
NVIDIA released Nemotron 3 Nano Omni on April 28, 2026, a 31B parameter multimodal model (30B active parameters, 3B per token) that processes video, audio, images, and text with a 256K token context window.
Model Architecture and Capabilities
The model uses a Mamba2-Transformer Hybrid Mixture of Experts (MoE) architecture, combining a Nemotron 3 Nano 30B LLM with CRADIO v4-H vision encoder and Parakeet speech encoder. According to NVIDIA, it supports video files up to 2 minutes (mp4, 1080p at 1 FPS/128 frames, 720p at 2 FPS/256 frames), audio files up to 1 hour (wav/mp3, 8kHz+ sampling), standard image formats (jpeg/png), and English text.
Key features include:
- Reasoning mode with chain-of-thought output (reasoning budget: 16,384 tokens, grace period: 1,024 tokens)
- JSON output format support
- Tool calling functionality
- Word-level timestamps for transcription
- GUI and OCR capabilities
Availability and Hardware Requirements
The model is available in three precision formats on Hugging Face:
- BF16 (~62GB)
- FP8
- NVFP4 (NVIDIA's 4-bit format)
NVIDIA specifies compatibility with Ampere (A100 80GB), Hopper (H100/H200), Blackwell (B200, RTX 5090, RTX Pro 6000 SE), and Lovelace (L40S) architectures. The model runs on vLLM 0.20.0, TensorRT LLM, llama.cpp, Ollama, and SGLang runtimes.
Training and Commercial Use
According to NVIDIA, the model was improved using Qwen3-VL-30B-A3B-Instruct, Qwen3.5-122B-A10B, Qwen3.5-397B-A17B, Qwen2.5-VL-72B-Instruct, and gpt-oss-120b. The model is available for commercial use under the NVIDIA Open Model Agreement.
NVIDIA targets enterprise use cases including customer service (video verification, OCR), media and entertainment (video analysis, dense captions), document intelligence (contracts, financial documents), and GUI automation for agentic applications.
Deployment Configuration
For single-GPU deployment (B200), NVIDIA recommends:
- Thinking mode: temperature 0.6, top_p 0.95, max_tokens 20,480
- Instruct mode: temperature 0.2, top_k 1, max_tokens 1,024
- Maximum model length: 131,072 tokens (expandable to full 256K context)
- FP8 KV cache for memory efficiency
The vLLM configuration supports up to 384 concurrent sequences with --max-num-seqs parameter.
What This Means
Nemotron 3 Nano Omni represents NVIDIA's push into unified multimodal processing for enterprise applications, directly competing with GPT-4V and Gemini 1.5 in video understanding. The 256K context window and 2-minute video support enable processing of full meeting recordings and training videos without chunking. The MoE architecture (3B active per token from 30B total) provides efficiency gains over dense models, though real-world performance benchmarks against competitors remain to be published. The reasoning mode positions it against o1-preview/o3-mini for tasks requiring step-by-step problem solving, while tool calling and JSON output support agentic workflows. Notably, NVIDIA provides GGUF quantizations via Unsloth for local deployment, expanding accessibility beyond datacenter GPUs to RTX 5090 and similar consumer hardware.
Related Articles
NVIDIA Releases Nemotron 3 Nano Omni: 31B-Parameter Multimodal Model with 256K Context and Reasoning Mode
NVIDIA has released Nemotron 3 Nano Omni 30B-A3B, a multimodal large language model with 31 billion parameters using a Mamba2-Transformer hybrid Mixture of Experts architecture. The model supports video, audio, image, and text inputs with a 256K token context window and includes a dedicated reasoning mode with chain-of-thought capabilities.
Nvidia releases Nemotron 3 Nano Omni: 30B-parameter multimodal model with 256K context, free on OpenRouter
Nvidia has released Nemotron 3 Nano Omni, a 30-billion-parameter multimodal model available free on OpenRouter. The model features a 256,000-token context window, accepts text, image, video, and audio inputs, and claims 2× higher throughput for video reasoning compared to separate vision and speech pipelines.
NVIDIA Nemotron 3 Nano Omni: 30B-parameter multimodal model launches on AWS SageMaker with 131K token context
NVIDIA has launched Nemotron 3 Nano Omni on Amazon SageMaker JumpStart, a multimodal model with 30 billion total parameters (3 billion active) that processes video, audio, images, and text in a single inference pass. The model features a 131K token context window and uses a Mamba2 Transformer Hybrid MoE architecture combining three specialized encoders.
NVIDIA Releases Nemotron 3 Nano Omni: 30B-A3B Multimodal Model With 100+ Page Document Support
NVIDIA released Nemotron 3 Nano Omni, a 30B-A3B Mixture-of-Experts model that processes text, images, video, and audio. The model uses a hybrid Mamba-Transformer architecture with 128 experts and achieves 65.8 on OCRBenchV2-En and 72.2 on Video-MME, while delivering up to 9x higher throughput on multimodal tasks compared to alternatives.
Comments
Loading...