NVIDIA Releases Cosmos3-Nano: 16B-Parameter Omnimodal World Model for Physical AI with 256K Token Context
NVIDIA has released Cosmos3-Nano, a 16-billion parameter omnimodal world model capable of generating video, audio, images, and robot action commands from combinations of text, image, video, and action trajectory inputs. The model supports a 256K token context window and is designed for Physical AI applications including robotics, autonomous vehicles, and smart manufacturing environments.
NVIDIA Releases Cosmos3-Nano: 16B-Parameter Omnimodal World Model for Physical AI
NVIDIA has released Cosmos3-Nano, a 16-billion parameter omnimodal world model that generates video, audio, images, and robot action commands from multimodal inputs. The model is part of the Cosmos3 collection and supports a 256K token context window for reasoning tasks.
Technical Specifications
Cosmos3-Nano uses a Mixture-of-Transformers (MoT) architecture combining an autoregressive transformer for discrete token generation with a diffusion transformer for continuous multimodal generation. The model accepts inputs in five modalities: text, images, video (with or without audio), and action trajectories.
Input specifications:
- Text: up to 256K tokens
- Images: 256p, 480p, and 720p at aspect ratios 16:9, 4:3, 1:1, 3:4, 9:16
- Video: same resolutions/aspect ratios, maximum 5 frames for input
- Audio: 48 kHz stereo, maximum 0.5 seconds
- Action trajectories: 16-400 frames
Output generation supports video from 5 to 400 frames (189 frames default), with audio encoded in AAC format at 48 kHz stereo. The model generates outputs at resolutions matching input specifications.
Model Variants
NVIDIA released multiple Cosmos3 variants simultaneously:
- Cosmos3-Nano: 16B parameters for general omnimodal tasks
- Cosmos3-Super: 64B parameters for enhanced performance
- Cosmos3-Nano-Policy-DROID: 16B parameters specialized for robot manipulation
- Cosmos3-Super-Image2Video: 64B parameters for image-to-video generation
- Cosmos3-Super-Text2Image: 64B parameters for text-to-image generation
Hardware and Deployment
The models require NVIDIA GPU-accelerated systems running on Ampere, Hopper, or Blackwell architectures. Only BF16 precision is officially supported. Runtime integration is available through PyTorch, vLLM-Omni, and Hugging Face Diffusers.
Cosmos3-Nano supports robot action generation for 10 different embodiments, including Franka Panda arms, WidowX 250, and various industrial robots. Action outputs are embodiment-specific, ranging from 9D to 57D depending on the robot platform.
Licensing and Availability
The model is released under the OpenMDW1.1 license for both commercial and non-commercial use. NVIDIA published the model collection on Hugging Face and GitHub on May 31, 2025, with global deployment availability.
Pricing information has not been disclosed.
What This Means
Cosmos3-Nano represents NVIDIA's entry into omnimodal foundation models specifically designed for embodied AI. The 256K context window for reasoning tasks and native support for action trajectory generation distinguishes it from general-purpose multimodal models. The architectural choice to combine autoregressive and diffusion transformers allows different generation mechanisms for discrete (text) versus continuous (video, audio, actions) modalities. With 10 supported robot embodiments at launch, NVIDIA is positioning Cosmos3 as infrastructure for physical AI development rather than a general-purpose model, directly targeting robotics researchers and autonomous system developers who need unified world modeling capabilities.
Related Articles
NVIDIA Releases Cosmos 3: 8B and 32B Omni-Models Combining Video Generation, Reasoning, and Action in Single Architectur
NVIDIA has released Cosmos 3, a unified omni-model that combines world generation, physical reasoning, and action generation in a single architecture. Available in 8B (Nano) and 32B (Super) parameter versions on Hugging Face, Cosmos 3 uses a Mixture-of-Transformers architecture to process text, image, video, audio, and action modalities without switching between separate models.
Mistral AI Releases Small 4: 119B Parameter Open-Source Model with 256K Context Under Apache 2.0
Mistral AI has released Mistral Small 4, a 119B total parameter mixture-of-experts model with 256K context window and native multimodal capabilities. The model uses 128 experts with 4 active per token (6B active parameters) and is released under the Apache 2.0 license, marking Mistral's first unified model combining reasoning, multimodal, and coding capabilities.
Mistral Releases Mistral Large 3 with 675B Parameters and Three Ministral 3 Models Under Apache 2.0
Mistral AI has released Mistral 3, consisting of Mistral Large 3—a sparse mixture-of-experts model with 675B total parameters and 41B active parameters—and three Ministral 3 models at 3B, 8B, and 14B parameters. All models are released under the Apache 2.0 license with multimodal capabilities including image understanding.
StepFun Releases Step-3.7-Flash: 198B-Parameter Sparse MoE Model With 256K Context in GGUF Format
StepFun has released Step-3.7-Flash, a 198B-parameter sparse Mixture-of-Experts vision-language model that activates approximately 11B parameters per token. The model supports a 256K context window, native image understanding via a 1.8B-parameter vision encoder, and offers three selectable reasoning levels.
Comments
Loading...