model releaseNVIDIA

NVIDIA Releases Nemotron 3 Nano Omni: 31B-Parameter Multimodal Model with 256K Context and Reasoning Mode

TL;DR

NVIDIA has released Nemotron 3 Nano Omni 30B-A3B, a multimodal large language model with 31 billion parameters using a Mamba2-Transformer hybrid Mixture of Experts architecture. The model supports video, audio, image, and text inputs with a 256K token context window and includes a dedicated reasoning mode with chain-of-thought capabilities.

2 min read
1

NVIDIA Releases Nemotron 3 Nano Omni: 31B-Parameter Multimodal Model with 256K Context and Reasoning Mode

NVIDIA has released Nemotron 3 Nano Omni 30B-A3B-Reasoning, a multimodal large language model with 31 billion parameters (30B active, 3B active per token in its MoE architecture). The model launched on April 28, 2026, on Hugging Face, Build.Nvidia.com, and NGC.

Architecture and Capabilities

Nemotron 3 Nano Omni uses a Mamba2-Transformer hybrid Mixture of Experts (MoE) architecture, combining a 30B-A3B Nemotron 3 Nano LLM base with a CRADIO v4-H vision encoder and Parakeet speech encoder. The model supports:

  • Video: MP4 files up to 2 minutes, sampling at 1-2 FPS depending on resolution (up to 256 frames for 720p)
  • Audio: WAV and MP3 files up to 1 hour, 8kHz+ sampling rates
  • Images: JPEG and PNG formats
  • Text: English only
  • Context window: 256,000 tokens

The model includes two operating modes: a reasoning mode with chain-of-thought capabilities (using 16,384-token reasoning budget) and a standard instruct mode. It supports JSON output formatting, tool calling, and word-level timestamps for transcription tasks.

Training and Development

According to NVIDIA, the model was improved using Qwen3-VL-30B-A3B-Instruct, Qwen3.5-122B-A10B, Qwen3.5-397B-A17B, Qwen2.5-VL-72B-Instruct, and gpt-oss-120b, though specific training methodology details were not disclosed.

Deployment and Availability

The model is available in three precision formats:

  • BF16 (approximately 62 GB)
  • FP8
  • NVFP4 (NVIDIA's 4-bit format)

NVIDIA claims the model is optimized for NVIDIA Ampere, Hopper, Blackwell, and Lovelace architectures, with specific support for A100, H100, H200, B200, L40S, RTX Pro 6000 SE, and RTX 5090 GPUs. The model requires vLLM 0.20.0 for inference.

The model is released under the NVIDIA Open Model Agreement and is available for commercial use.

Target Use Cases

NVIDIA positions Nemotron 3 Nano Omni for enterprise applications including:

  • Customer service (video verification, drive-thru order processing)
  • Media and entertainment video analysis
  • Document intelligence for contracts and financial documents
  • GUI automation for AI agents
  • Meeting transcription and summarization

What This Means

Nemotron 3 Nano Omni represents NVIDIA's entry into the multimodal reasoning model category, directly competing with models like GPT-4o and Claude 3.5 Sonnet. The 256K context window and dedicated reasoning mode position it for enterprise document processing tasks. However, the English-only limitation and lack of disclosed benchmark scores make performance comparisons difficult. The MoE architecture at 31B parameters suggests efficiency goals, though actual inference costs and speeds on various hardware remain to be independently verified. The model's integration with vLLM and support for NVIDIA's NVFP4 quantization indicate a focus on deployment flexibility across NVIDIA's hardware ecosystem.

Related Articles

model release

Moonshot AI releases Kimi K2.7 Code with 1T parameters, 256K context window, 30% lower thinking token usage

Moonshot AI has released Kimi K2.7 Code, a 1 trillion parameter Mixture-of-Experts model designed for long-horizon coding tasks. The model features a 256K context window and reduces thinking token usage by approximately 30% compared to its predecessor K2.6.

model release

Nvidia releases Nemotron 3 Ultra: 550B-parameter MoE model with 1M context window for agentic workflows

Nvidia has released Nemotron 3 Ultra, a 550-billion parameter mixture-of-experts model with 55 billion active parameters and support for up to 1 million token context windows. The model uses a hybrid Transformer-Mamba architecture and is designed specifically for long-running agentic workflows including agent orchestration, coding agents, and complex enterprise tasks.

model release

Apple releases AFM 3 lineup: 20B-parameter on-device model and cloud AI running on Google's Nvidia infrastructure

Apple announced five third-generation foundation models at WWDC26, headlined by AFM 3 Core Advanced—a 20-billion-parameter sparse model that runs on-device by activating only 1-4 billion parameters at a time. For the first time, Apple extended Private Cloud Compute to third-party infrastructure, with AFM 3 Cloud Pro running on Nvidia GPUs in Google Cloud.

model release

Nex AGI Releases Nex-N2-Pro: 17B Active Parameter MoE Model with 262K Context Window

Nex AGI has released Nex-N2-Pro, a mixture-of-experts model with 17 billion active parameters from a total of 397 billion parameters. Built on the Qwen3.5 architecture, the model offers a 262,144 token context window and is available for free through OpenRouter.

Comments

Loading...