model releaseNVIDIA

NVIDIA releases Nemotron-3-Nano-Omni-30B, a 31B-parameter multimodal model with 256K context and reasoning mode

TL;DR

NVIDIA released Nemotron-3-Nano-Omni-30B-A3B, a multimodal large language model with 31 billion parameters that processes video, audio, images, and text with up to 256K token context. The model uses a Mamba2-Transformer hybrid Mixture of Experts architecture and supports chain-of-thought reasoning mode.

2 min read
0

NVIDIA Releases Nemotron-3-Nano-Omni-30B with Multimodal Processing and Reasoning Mode

NVIDIA released Nemotron-3-Nano-Omni-30B-A3B, a 31 billion-parameter multimodal model that processes video, audio, images, and text with up to 256,000 token context length. The model is available commercially under the NVIDIA Open Model Agreement.

Architecture and Specifications

Nemotron-3-Nano-Omni uses a Mamba2-Transformer hybrid Mixture of Experts (MoE) architecture with 31B total parameters and 3B active parameters (A3B). The model combines three specialized encoders:

  • Nemotron 3 Nano LLM (30B A3B) for language processing
  • CRADIO v4-H vision encoder for image and video
  • Parakeet speech encoder for audio

The model accepts video files up to 2 minutes at 1 FPS (1080p) or 2 FPS (720p), audio files up to 1 hour, and images in JPEG/PNG format. It supports English only.

Key Capabilities

According to NVIDIA, the model provides:

  • Video and speech comprehension
  • GUI automation and OCR
  • Speech transcription with word-level timestamps
  • JSON output format support
  • Chain-of-thought reasoning mode with configurable reasoning budget (up to 16,384 tokens)
  • Tool calling capabilities

Training and Development

NVIDIA states the model was improved using Qwen3-VL-30B-A3B-Instruct, Qwen3.5-122B-A10B, Qwen3.5-397B-A17B, Qwen2.5-VL-72B-Instruct, and gpt-oss-120b. Specific training methodologies and benchmark scores were not disclosed.

Deployment Requirements

The model requires vLLM 0.20.0 and runs on NVIDIA Ampere, Hopper, Blackwell, and Lovelace GPUs. Available precision formats include BF16 (~62GB), FP8, and NVFP4. NVIDIA recommends 131,072 maximum model length for single-GPU deployment with tensor-parallel-size 1.

Recommended inference parameters vary by mode:

  • Thinking mode: temperature 0.6, top_p 0.95, max_tokens 20,480
  • Instruct mode: temperature 0.2, top_k 1, max_tokens 1,024

The model supports deployment on edge devices including Jetson Thor and consumer hardware like RTX 5090.

Availability

Nemotron-3-Nano-Omni-30B is available on Hugging Face, Build.Nvidia.com, and NGC as of April 28, 2026. Runtime engines include vLLM, TensorRT-LLM, NeMo Megatron, llama.cpp, Ollama, and SGLang.

What This Means

NVIDIA's release targets enterprise multimodal applications that require unified processing of video, audio, and documents—use cases that previously required multiple specialized models. The 256K context window and reasoning mode position it for complex document analysis and extended video processing. The commercial license and edge deployment support (including consumer RTX 5090) differentiate it from research-focused multimodal models, though pricing and comparative benchmarks against competitors like GPT-4V or Gemini were not provided.

Related Articles

model release

Amazon Bedrock adds Gemma 4 models with 256K context and built-in reasoning mode

Amazon Web Services today announced availability of Google DeepMind's Gemma 4 family on Amazon Bedrock. The open-weight models include three instruction-tuned variants spanning 2.3B to 30.7B parameters, with 256K context windows, multimodal input support, and built-in reasoning mode.

model release

Moonshot AI releases Kimi K2.7 Code with 1T parameters, 256K context window, 30% lower thinking token usage

Moonshot AI has released Kimi K2.7 Code, a 1 trillion parameter Mixture-of-Experts model designed for long-horizon coding tasks. The model features a 256K context window and reduces thinking token usage by approximately 30% compared to its predecessor K2.6.

model release

Apple releases AFM 3 lineup: 20B-parameter on-device model and cloud AI running on Google's Nvidia infrastructure

Apple announced five third-generation foundation models at WWDC26, headlined by AFM 3 Core Advanced—a 20-billion-parameter sparse model that runs on-device by activating only 1-4 billion parameters at a time. For the first time, Apple extended Private Cloud Compute to third-party infrastructure, with AFM 3 Cloud Pro running on Nvidia GPUs in Google Cloud.

model release

MiniMax Releases M3: 428B-Parameter Multimodal Model with 1M Context Window and 15× Decode Speedup

MiniMax has released M3, a multimodal model with approximately 428 billion parameters and 23 billion activated parameters. The model supports a 1 million token context window and uses MiniMax Sparse Attention to achieve 9× prefill and 15× decode speedups compared to its predecessor M2.

Comments

Loading...