NVIDIA Releases Cosmos 3: 64B-Parameter Omnimodal World Model for Physical AI
NVIDIA released Cosmos 3, an omnimodal world foundation model platform for Physical AI spanning robotics, autonomous driving, and industrial environments. The flagship Cosmos3-Super variant contains 64 billion parameters and generates video, images, audio, and action commands from text, image, video, and action trajectory inputs using a Mixture-of-Transformers architecture.
Cosmos 3 Super Image2Video — Quick Specs
NVIDIA Releases Cosmos 3: 64B-Parameter Omnimodal World Model for Physical AI
NVIDIA released Cosmos 3, an omnimodal world foundation model platform designed to accelerate Physical AI development across robotics, autonomous vehicles, and industrial environments. The model collection is available on Hugging Face and GitHub as of May 31, 2026.
Model Specifications
Cosmos 3 comes in five variants with parameter counts ranging from 16B to 64B:
- Cosmos3-Nano: 16 billion parameters
- Cosmos3-Super: 64 billion parameters
- Cosmos3-Nano-Policy-DROID: 16 billion parameters (robotics-specific)
- Cosmos3-Super-Image2Video: 64 billion parameters
- Cosmos3-Super-Text2Image: 64 billion parameters
All models are released under the OpenMDW1.1 license for commercial and non-commercial use.
Technical Architecture
Cosmos 3 uses a Mixture-of-Transformers (MoT) architecture consisting of two complementary transformer towers: an autoregressive transformer for discrete token generation and a diffusion transformer for continuous multimodal generation. Text generates through standard next-token autoregressive decoding, while non-text modalities synthesize through iterative denoising.
Input and Output Capabilities
The models accept multimodal inputs including:
- Text: Up to 256K tokens context window for reasoning tasks
- Images: 256p, 480p, and 720p at aspect ratios 16:9, 4:3, 1:1, 3:4, 9:16
- Video: Up to 5 input frames at the same resolutions
- Audio: 48 kHz stereo with up to 0.5 second duration
- Action trajectories: Compatible with 10 robot embodiments including Franka Panda, UR, Google robot, and WidowX 250
Outputs include video generation from 5 to 400 frames (default 189 frames), images in JPEG format, 48 kHz stereo AAC audio, and robot action sequences.
Robot Platform Support
Cosmos 3 supports action generation for specific robot platforms with dimensionality ranging from 9D (camera motion, UMI) to 57D (egocentric motion). Compatible embodiments include single and dual Franka Panda arms with RobotiQ grippers (10D and 20D), Agibot (29D), and autonomous vehicle control (9D).
Hardware Requirements
According to NVIDIA, the models are optimized for NVIDIA Ampere, Hopper, and Blackwell GPU architectures running Linux. Only BF16 precision is officially supported and tested. The models integrate with PyTorch, vLLM-Omni, and Hugging Face Diffusers runtimes.
What This Means
Cosmos 3 represents NVIDIA's entry into omnimodal world models that bridge multiple AI modalities—vision, language, audio, and robotic control—within a unified architecture. The 256K token context window for reasoning and support for 10 robot embodiments positions it for physical AI applications requiring long-horizon planning. The dual-transformer architecture's separation of discrete and continuous generation mechanisms addresses a core challenge in multimodal modeling. Pricing information has not been disclosed.
Related Articles
NVIDIA Releases Cosmos3-Super: 64B-Parameter Omnimodal World Model for Physical AI
NVIDIA released Cosmos3-Super, a 64-billion parameter omnimodal foundation model that generates video, images, audio, and action commands from combinations of text, image, video, and action trajectory inputs. The model, part of the Cosmos3 collection, targets Physical AI applications including robotics, autonomous vehicles, and industrial automation.
NVIDIA Releases Cosmos3-Nano: 16B-Parameter Omnimodal World Model for Physical AI with 256K Token Context
NVIDIA has released Cosmos3-Nano, a 16-billion parameter omnimodal world model capable of generating video, audio, images, and robot action commands from combinations of text, image, video, and action trajectory inputs. The model supports a 256K token context window and is designed for Physical AI applications including robotics, autonomous vehicles, and smart manufacturing environments.
NVIDIA Releases Cosmos 3: 8B and 32B Omni-Models Combining Video Generation, Reasoning, and Action in Single Architectur
NVIDIA has released Cosmos 3, a unified omni-model that combines world generation, physical reasoning, and action generation in a single architecture. Available in 8B (Nano) and 32B (Super) parameter versions on Hugging Face, Cosmos 3 uses a Mixture-of-Transformers architecture to process text, image, video, audio, and action modalities without switching between separate models.
Mistral AI Releases Small 4: 119B Parameter Open-Source Model with 256K Context Under Apache 2.0
Mistral AI has released Mistral Small 4, a 119B total parameter mixture-of-experts model with 256K context window and native multimodal capabilities. The model uses 128 experts with 4 active per token (6B active parameters) and is released under the Apache 2.0 license, marking Mistral's first unified model combining reasoning, multimodal, and coding capabilities.
Comments
Loading...