model releaseNVIDIA

NVIDIA Releases Nemotron 3 Nano Omni: 30B-A3B Multimodal Model With 100+ Page Document Support

TL;DR

NVIDIA released Nemotron 3 Nano Omni, a 30B-A3B Mixture-of-Experts model that processes text, images, video, and audio. The model uses a hybrid Mamba-Transformer architecture with 128 experts and achieves 65.8 on OCRBenchV2-En and 72.2 on Video-MME, while delivering up to 9x higher throughput on multimodal tasks compared to alternatives.

2 min read
0

NVIDIA released Nemotron 3 Nano Omni on April 28, 2026, a multimodal model that processes text, images, video, and audio in a unified architecture. The model uses a 30B-A3B Mixture-of-Experts backbone with 128 experts and top-6 routing.

Architecture and Scale

Nemotron 3 Nano Omni combines three encoder systems: C-RADIOv4-H for vision, Parakeet-TDT-0.6B-v2 for audio, and the Nemotron 3 Nano 30B-A3B language model. The architecture interleaves 23 Mamba selective state-space layers, 23 MoE layers with 128 experts, and 6 grouped-query attention layers.

For vision processing, the model supports dynamic resolution from 512x512 (1,024 patches) to 1840x1840 (13,312 patches) at native aspect ratio. Video processing uses Conv3D tubelet embedding that fuses consecutive frame pairs, halving the number of vision tokens.

Benchmark Performance

According to NVIDIA, Nemotron 3 Nano Omni achieves:

  • 65.8 on OCRBenchV2-En (versus 61.2 for its predecessor Nemotron Nano V2 VL)
  • 57.5 on MMLongBench-Doc
  • 72.2 on Video-MME
  • 89.4 on VoiceBench
  • 5.95 word error rate on HF Open ASR
  • 57.8 on ScreenSpot-Pro for GUI understanding
  • 47.4 on OSWorld for computer use tasks

The model leads Qwen3-Omni 30B-A3B on most benchmarks, including document understanding (57.5 vs 49.5 on MMLongBench-Doc) and voice interaction (89.4 vs 88.8 on VoiceBench).

Throughput Claims

NVIDIA claims Nemotron 3 Nano Omni delivers up to 9x higher throughput and 2.9x faster single-stream reasoning speed on multimodal use cases compared to unspecified alternatives. The company states 7.4x higher system efficiency for multi-document workloads and 9.2x for video use cases compared to "other open omni models with the same interactivity."

Training Approach

The training recipe uses staged multimodal alignment and context extension, followed by preference optimization and multimodal reinforcement learning. The model can process 100+ page documents and includes an Efficient Video Sampling (EVS) feature that drops redundant video tokens after the vision encoder to reduce latency.

Availability

Nemotron 3 Nano Omni is available on Hugging Face in BF16, FP8, and NVFP4 formats. Pricing information was not disclosed.

What This Means

Nemotron 3 Nano Omni represents NVIDIA's entry into the competitive omni-modal space, positioning against models like Qwen3-Omni with a focus on enterprise document processing and computer use tasks. The hybrid Mamba-Transformer-MoE architecture is a notable architectural choice that differs from pure attention-based approaches, though real-world deployment efficiency will depend on framework support for these specialized layers. The strong document understanding scores (65.8 on OCRBenchV2) and computer use capabilities (47.4 on OSWorld) suggest practical applicability for enterprise workflows, though independent verification of throughput claims and production performance remains needed.

Related Articles

model release

NVIDIA Nemotron 3 Nano Omni: 30B-parameter multimodal model launches on AWS SageMaker with 131K token context

NVIDIA has launched Nemotron 3 Nano Omni on Amazon SageMaker JumpStart, a multimodal model with 30 billion total parameters (3 billion active) that processes video, audio, images, and text in a single inference pass. The model features a 131K token context window and uses a Mamba2 Transformer Hybrid MoE architecture combining three specialized encoders.

model release

Nvidia releases Nemotron 3 Nano Omni: 30B-parameter multimodal model with 256K context, free on OpenRouter

Nvidia has released Nemotron 3 Nano Omni, a 30-billion-parameter multimodal model available free on OpenRouter. The model features a 256,000-token context window, accepts text, image, video, and audio inputs, and claims 2× higher throughput for video reasoning compared to separate vision and speech pipelines.

model release

Xiaomi releases MiMo-V2.5: 310B parameter omnimodal model with 1M token context window

Xiaomi released MiMo-V2.5, a 310B total parameter sparse mixture-of-experts model that activates 15B parameters per token. The omnimodal model supports text, image, video, and audio understanding with a 1M token context window and was trained on 48T tokens using FP8 mixed precision.

model release

Xiaomi Releases MiMo-V2.5-Pro: 1.02T Parameter MoE Model with 1M Context Window

Xiaomi has released MiMo-V2.5-Pro, an open-source Mixture-of-Experts model with 1.02 trillion total parameters and 42 billion active parameters. The model supports up to 1 million tokens context length and claims 99.6% on GSM8K and 86.2% on MATH benchmarks.

Comments

Loading...