model releaseNVIDIA

NVIDIA Releases Nemotron 3 Nano Omni: 30B-A3B Multimodal Model With 100+ Page Document Support

TL;DR

NVIDIA released Nemotron 3 Nano Omni, a 30B-A3B Mixture-of-Experts model that processes text, images, video, and audio. The model uses a hybrid Mamba-Transformer architecture with 128 experts and achieves 65.8 on OCRBenchV2-En and 72.2 on Video-MME, while delivering up to 9x higher throughput on multimodal tasks compared to alternatives.

2 min read
0

NVIDIA released Nemotron 3 Nano Omni on April 28, 2026, a multimodal model that processes text, images, video, and audio in a unified architecture. The model uses a 30B-A3B Mixture-of-Experts backbone with 128 experts and top-6 routing.

Architecture and Scale

Nemotron 3 Nano Omni combines three encoder systems: C-RADIOv4-H for vision, Parakeet-TDT-0.6B-v2 for audio, and the Nemotron 3 Nano 30B-A3B language model. The architecture interleaves 23 Mamba selective state-space layers, 23 MoE layers with 128 experts, and 6 grouped-query attention layers.

For vision processing, the model supports dynamic resolution from 512x512 (1,024 patches) to 1840x1840 (13,312 patches) at native aspect ratio. Video processing uses Conv3D tubelet embedding that fuses consecutive frame pairs, halving the number of vision tokens.

Benchmark Performance

According to NVIDIA, Nemotron 3 Nano Omni achieves:

  • 65.8 on OCRBenchV2-En (versus 61.2 for its predecessor Nemotron Nano V2 VL)
  • 57.5 on MMLongBench-Doc
  • 72.2 on Video-MME
  • 89.4 on VoiceBench
  • 5.95 word error rate on HF Open ASR
  • 57.8 on ScreenSpot-Pro for GUI understanding
  • 47.4 on OSWorld for computer use tasks

The model leads Qwen3-Omni 30B-A3B on most benchmarks, including document understanding (57.5 vs 49.5 on MMLongBench-Doc) and voice interaction (89.4 vs 88.8 on VoiceBench).

Throughput Claims

NVIDIA claims Nemotron 3 Nano Omni delivers up to 9x higher throughput and 2.9x faster single-stream reasoning speed on multimodal use cases compared to unspecified alternatives. The company states 7.4x higher system efficiency for multi-document workloads and 9.2x for video use cases compared to "other open omni models with the same interactivity."

Training Approach

The training recipe uses staged multimodal alignment and context extension, followed by preference optimization and multimodal reinforcement learning. The model can process 100+ page documents and includes an Efficient Video Sampling (EVS) feature that drops redundant video tokens after the vision encoder to reduce latency.

Availability

Nemotron 3 Nano Omni is available on Hugging Face in BF16, FP8, and NVFP4 formats. Pricing information was not disclosed.

What This Means

Nemotron 3 Nano Omni represents NVIDIA's entry into the competitive omni-modal space, positioning against models like Qwen3-Omni with a focus on enterprise document processing and computer use tasks. The hybrid Mamba-Transformer-MoE architecture is a notable architectural choice that differs from pure attention-based approaches, though real-world deployment efficiency will depend on framework support for these specialized layers. The strong document understanding scores (65.8 on OCRBenchV2) and computer use capabilities (47.4 on OSWorld) suggest practical applicability for enterprise workflows, though independent verification of throughput claims and production performance remains needed.

Related Articles

model release

Nvidia releases Nemotron 3 Ultra: 550B-parameter MoE model with 1M context window for agentic workflows

Nvidia has released Nemotron 3 Ultra, a 550-billion parameter mixture-of-experts model with 55 billion active parameters and support for up to 1 million token context windows. The model uses a hybrid Transformer-Mamba architecture and is designed specifically for long-running agentic workflows including agent orchestration, coding agents, and complex enterprise tasks.

model release

Moonshot AI releases Kimi K2.7 Code with 1T parameters, 256K context window, 30% lower thinking token usage

Moonshot AI has released Kimi K2.7 Code, a 1 trillion parameter Mixture-of-Experts model designed for long-horizon coding tasks. The model features a 256K context window and reduces thinking token usage by approximately 30% compared to its predecessor K2.6.

model release

Apple releases AFM 3 lineup: 20B-parameter on-device model and cloud AI running on Google's Nvidia infrastructure

Apple announced five third-generation foundation models at WWDC26, headlined by AFM 3 Core Advanced—a 20-billion-parameter sparse model that runs on-device by activating only 1-4 billion parameters at a time. For the first time, Apple extended Private Cloud Compute to third-party infrastructure, with AFM 3 Cloud Pro running on Nvidia GPUs in Google Cloud.

model release

Google DeepMind releases Gemma 4 12B: encoder-free multimodal model runs on 16GB RAM

Google DeepMind has released Gemma 4 12B, a 12-billion parameter multimodal model that runs locally on laptops with 16GB of RAM. The model eliminates separate vision and audio encoders, processing raw inputs directly through its language model backbone under an Apache 2.0 license.

Comments

Loading...