NVIDIA Releases Nemotron 3 Nano Omni: 31B Multimodal Model With 256K Context and Reasoning Mode
NVIDIA released Nemotron 3 Nano Omni, a 31B parameter (30B active, 3B per token) multimodal model supporting video, audio, image, and text inputs. The model features a 256K token context window, reasoning mode with chain-of-thought, and tool calling capabilities.
NVIDIA Nemotron 3 Nano Omni 30B A3B Reasoning — Quick Specs
NVIDIA Releases Nemotron 3 Nano Omni: 31B Multimodal Model With 256K Context and Reasoning Mode
NVIDIA released Nemotron 3 Nano Omni on April 28, 2026, a 31B parameter multimodal model (30B active parameters, 3B per token) that processes video, audio, images, and text with a 256K token context window.
Model Architecture and Capabilities
The model uses a Mamba2-Transformer Hybrid Mixture of Experts (MoE) architecture, combining a Nemotron 3 Nano 30B LLM with CRADIO v4-H vision encoder and Parakeet speech encoder. According to NVIDIA, it supports video files up to 2 minutes (mp4, 1080p at 1 FPS/128 frames, 720p at 2 FPS/256 frames), audio files up to 1 hour (wav/mp3, 8kHz+ sampling), standard image formats (jpeg/png), and English text.
Key features include:
- Reasoning mode with chain-of-thought output (reasoning budget: 16,384 tokens, grace period: 1,024 tokens)
- JSON output format support
- Tool calling functionality
- Word-level timestamps for transcription
- GUI and OCR capabilities
Availability and Hardware Requirements
The model is available in three precision formats on Hugging Face:
- BF16 (~62GB)
- FP8
- NVFP4 (NVIDIA's 4-bit format)
NVIDIA specifies compatibility with Ampere (A100 80GB), Hopper (H100/H200), Blackwell (B200, RTX 5090, RTX Pro 6000 SE), and Lovelace (L40S) architectures. The model runs on vLLM 0.20.0, TensorRT LLM, llama.cpp, Ollama, and SGLang runtimes.
Training and Commercial Use
According to NVIDIA, the model was improved using Qwen3-VL-30B-A3B-Instruct, Qwen3.5-122B-A10B, Qwen3.5-397B-A17B, Qwen2.5-VL-72B-Instruct, and gpt-oss-120b. The model is available for commercial use under the NVIDIA Open Model Agreement.
NVIDIA targets enterprise use cases including customer service (video verification, OCR), media and entertainment (video analysis, dense captions), document intelligence (contracts, financial documents), and GUI automation for agentic applications.
Deployment Configuration
For single-GPU deployment (B200), NVIDIA recommends:
- Thinking mode: temperature 0.6, top_p 0.95, max_tokens 20,480
- Instruct mode: temperature 0.2, top_k 1, max_tokens 1,024
- Maximum model length: 131,072 tokens (expandable to full 256K context)
- FP8 KV cache for memory efficiency
The vLLM configuration supports up to 384 concurrent sequences with --max-num-seqs parameter.
What This Means
Nemotron 3 Nano Omni represents NVIDIA's push into unified multimodal processing for enterprise applications, directly competing with GPT-4V and Gemini 1.5 in video understanding. The 256K context window and 2-minute video support enable processing of full meeting recordings and training videos without chunking. The MoE architecture (3B active per token from 30B total) provides efficiency gains over dense models, though real-world performance benchmarks against competitors remain to be published. The reasoning mode positions it against o1-preview/o3-mini for tasks requiring step-by-step problem solving, while tool calling and JSON output support agentic workflows. Notably, NVIDIA provides GGUF quantizations via Unsloth for local deployment, expanding accessibility beyond datacenter GPUs to RTX 5090 and similar consumer hardware.
Related Articles
Moonshot AI releases Kimi K2.7 Code with 1T parameters, 256K context window, 30% lower thinking token usage
Moonshot AI has released Kimi K2.7 Code, a 1 trillion parameter Mixture-of-Experts model designed for long-horizon coding tasks. The model features a 256K context window and reduces thinking token usage by approximately 30% compared to its predecessor K2.6.
Apple releases AFM 3 lineup: 20B-parameter on-device model and cloud AI running on Google's Nvidia infrastructure
Apple announced five third-generation foundation models at WWDC26, headlined by AFM 3 Core Advanced—a 20-billion-parameter sparse model that runs on-device by activating only 1-4 billion parameters at a time. For the first time, Apple extended Private Cloud Compute to third-party infrastructure, with AFM 3 Cloud Pro running on Nvidia GPUs in Google Cloud.
Nex AGI Releases Nex-N2-Pro: 17B Active Parameter MoE Model with 262K Context Window
Nex AGI has released Nex-N2-Pro, a mixture-of-experts model with 17 billion active parameters from a total of 397 billion parameters. Built on the Qwen3.5 architecture, the model offers a 262,144 token context window and is available for free through OpenRouter.
Nex AGI Releases Nex-N2-Pro: 397B Parameter MoE Model With 262K Context, Available Free
Nex AGI has released Nex-N2-Pro, an agentic mixture-of-experts model with 397B total parameters and 17B active parameters. The model features a 262K token context window and is available free via OpenRouter's API.
Comments
Loading...