model releaseNVIDIA

NVIDIA Releases Cosmos3-Nano: 16B-Parameter Omnimodal World Model for Physical AI with 256K Token Context

TL;DR

NVIDIA has released Cosmos3-Nano, a 16-billion parameter omnimodal world model capable of generating video, audio, images, and robot action commands from combinations of text, image, video, and action trajectory inputs. The model supports a 256K token context window and is designed for Physical AI applications including robotics, autonomous vehicles, and smart manufacturing environments.

June 2, 2026 · 1:51 AM2 min read

Cosmos3-Nano — Quick Specs

Context window256K tokens

Compare Cosmos3-Nano with other models →

NVIDIA Releases Cosmos3-Nano: 16B-Parameter Omnimodal World Model for Physical AI

NVIDIA has released Cosmos3-Nano, a 16-billion parameter omnimodal world model that generates video, audio, images, and robot action commands from multimodal inputs. The model is part of the Cosmos3 collection and supports a 256K token context window for reasoning tasks.

Technical Specifications

Cosmos3-Nano uses a Mixture-of-Transformers (MoT) architecture combining an autoregressive transformer for discrete token generation with a diffusion transformer for continuous multimodal generation. The model accepts inputs in five modalities: text, images, video (with or without audio), and action trajectories.

Input specifications:

Text: up to 256K tokens
Images: 256p, 480p, and 720p at aspect ratios 16:9, 4:3, 1:1, 3:4, 9:16
Video: same resolutions/aspect ratios, maximum 5 frames for input
Audio: 48 kHz stereo, maximum 0.5 seconds
Action trajectories: 16-400 frames

Output generation supports video from 5 to 400 frames (189 frames default), with audio encoded in AAC format at 48 kHz stereo. The model generates outputs at resolutions matching input specifications.

Model Variants

NVIDIA released multiple Cosmos3 variants simultaneously:

Cosmos3-Nano: 16B parameters for general omnimodal tasks
Cosmos3-Super: 64B parameters for enhanced performance
Cosmos3-Nano-Policy-DROID: 16B parameters specialized for robot manipulation
Cosmos3-Super-Image2Video: 64B parameters for image-to-video generation
Cosmos3-Super-Text2Image: 64B parameters for text-to-image generation

Hardware and Deployment

The models require NVIDIA GPU-accelerated systems running on Ampere, Hopper, or Blackwell architectures. Only BF16 precision is officially supported. Runtime integration is available through PyTorch, vLLM-Omni, and Hugging Face Diffusers.

Cosmos3-Nano supports robot action generation for 10 different embodiments, including Franka Panda arms, WidowX 250, and various industrial robots. Action outputs are embodiment-specific, ranging from 9D to 57D depending on the robot platform.

Licensing and Availability

The model is released under the OpenMDW1.1 license for both commercial and non-commercial use. NVIDIA published the model collection on Hugging Face and GitHub on May 31, 2025, with global deployment availability.

Pricing information has not been disclosed.

What This Means

Cosmos3-Nano represents NVIDIA's entry into omnimodal foundation models specifically designed for embodied AI. The 256K context window for reasoning tasks and native support for action trajectory generation distinguishes it from general-purpose multimodal models. The architectural choice to combine autoregressive and diffusion transformers allows different generation mechanisms for discrete (text) versus continuous (video, audio, actions) modalities. With 10 supported robot embodiments at launch, NVIDIA is positioning Cosmos3 as infrastructure for physical AI development rather than a general-purpose model, directly targeting robotics researchers and autonomous system developers who need unified world modeling capabilities.

Source: huggingface.co ↗

nvidia world-models multimodal robotics autonomous-vehicles physical-ai mixture-of-transformers video-generation

model releaseJuly 16, 2026

Nvidia Launches Cosmos 3 Edge World Model for Physical AI, Forms Japan Industrial Coalition

Nvidia released Cosmos 3 Edge, a world model designed for robots and vision AI agents to perceive and navigate physical environments in real time. The company announced partnerships with Japanese industrial giants including Fujitsu, Hitachi, and Kawasaki Heavy Industries as part of its physical AI expansion.

benchmarkJuly 16, 2026

NVIDIA Nemotron 3 Embed 8B Tops RTEB Leaderboard with 78.5% Score, 1B Variant Cuts Error Rate 27%

NVIDIA's Nemotron-3-Embed-8B-BF16 ranks #1 on the RTEB leaderboard with a 78.5% score, while the 1B variant reduces error rate by 27% over its predecessor. The open-weight models feature 32k context windows and production-ready deployment options including a Blackwell-optimized NVFP4 variant.

model releaseJuly 16, 2026

Thinking Machines Lab releases Inkling: 975B-parameter open-weights multimodal model under Apache-2.0

Thinking Machines Lab released Inkling, a Mixture-of-Experts transformer with 975B total parameters and 41B active parameters, trained on 45 trillion tokens of text, images, audio and video. The Apache-2.0 licensed model is designed as a base for fine-tuning rather than a frontier model.

model releaseJuly 16, 2026

Moonshot AI Releases Kimi K3: Open-Weight Multimodal Reasoning Model with 1M Context Window

Moonshot AI has released Kimi K3, an open-weight multimodal reasoning model with a 1-million token context window. The model is priced at $3 per 1M input tokens and $15 per 1M output tokens, available through OpenRouter.

NVIDIA Releases Cosmos3-Nano: 16B-Parameter Omnimodal World Model for Physical AI with 256K Token Context

Cosmos3-Nano — Quick Specs

NVIDIA Releases Cosmos3-Nano: 16B-Parameter Omnimodal World Model for Physical AI

Technical Specifications

Model Variants

Hardware and Deployment

Licensing and Availability

What This Means

Related Articles

Nvidia Launches Cosmos 3 Edge World Model for Physical AI, Forms Japan Industrial Coalition

NVIDIA Nemotron 3 Embed 8B Tops RTEB Leaderboard with 78.5% Score, 1B Variant Cuts Error Rate 27%

Thinking Machines Lab releases Inkling: 975B-parameter open-weights multimodal model under Apache-2.0

Moonshot AI Releases Kimi K3: Open-Weight Multimodal Reasoning Model with 1M Context Window

Comments