NVIDIA Releases Cosmos3-Super: 64B-Parameter Omnimodal World Model for Physical AI
NVIDIA released Cosmos3-Super, a 64-billion parameter omnimodal foundation model that generates video, images, audio, and action commands from combinations of text, image, video, and action trajectory inputs. The model, part of the Cosmos3 collection, targets Physical AI applications including robotics, autonomous vehicles, and industrial automation.
NVIDIA Cosmos3-Super — Quick Specs
NVIDIA Releases Cosmos3-Super: 64B-Parameter Omnimodal World Model for Physical AI
NVIDIA released Cosmos3-Super, a 64-billion parameter omnimodal foundation model designed for Physical AI applications across robotics, autonomous driving, and industrial environments. The model generates video, images, audio, and robot action commands from multimodal inputs.
Model Architecture and Specifications
Cosmos3-Super uses a Mixture-of-Transformers (MoT) architecture combining an autoregressive transformer for text generation with a diffusion transformer for continuous multimodal outputs. The model processes text, images, video (with or without audio), and action trajectories as inputs.
The model supports context windows up to 256K tokens for reasoning tasks and accepts video inputs at resolutions up to 720p across multiple aspect ratios (16:9, 4:3, 1:1, 3:4, 9:16). Video generation handles 5 to 400 frames, with 189 frames as the default duration.
Model Collection
NVIDIA released four variants:
- Cosmos3-Nano: 16B parameters for multimodal understanding and generation
- Cosmos3-Super: 64B parameters for advanced world simulation
- Cosmos3-Nano-Policy-DROID: 16B parameters fine-tuned for DROID robot platform
- Cosmos3-Super-Image2Video: 64B parameters specialized for image-to-video generation
- Cosmos3-Super-Text2Image: 64B parameters for text-to-image synthesis
Technical Capabilities
The model supports multiple robot embodiments including Franka Panda arms, Agibot, UR robots, Google robots, WidowX 250, and UMI platforms. Action outputs are dimensioned for specific embodiments, ranging from 9D for camera motion to 57D for egocentric motion.
Audio processing operates at 48 kHz stereo with AAC encoding. Video inputs require 4 fps for optimal reasoning performance.
Availability and Requirements
The models are available on Hugging Face and GitHub under the OpenMDW1.1 license for commercial and non-commercial use. NVIDIA states the models are optimized for NVIDIA Ampere, Hopper, and Blackwell architectures running Linux. Only BF16 precision is officially tested and supported.
Supported runtimes include PyTorch, vLLM-Omni, and Hugging Face Diffusers.
What This Means
Cosmos3-Super represents NVIDIA's entry into world models for embodied AI, directly competing with approaches from companies like OpenAI and Google DeepMind in the Physical AI space. The 256K token context window and native action trajectory generation distinguish it from vision-language models without embodied AI capabilities. The release of specialized variants for specific platforms (DROID) suggests NVIDIA is positioning Cosmos as both a research foundation and a commercial robotics development tool. Pricing has not been disclosed, which will be critical for adoption given the model's 64B parameter scale and GPU requirements.
Related Articles
NVIDIA Releases Cosmos3-Nano: 16B-Parameter Omnimodal World Model for Physical AI with 256K Token Context
NVIDIA has released Cosmos3-Nano, a 16-billion parameter omnimodal world model capable of generating video, audio, images, and robot action commands from combinations of text, image, video, and action trajectory inputs. The model supports a 256K token context window and is designed for Physical AI applications including robotics, autonomous vehicles, and smart manufacturing environments.
NVIDIA Releases Cosmos 3: 8B and 32B Omni-Models Combining Video Generation, Reasoning, and Action in Single Architectur
NVIDIA has released Cosmos 3, a unified omni-model that combines world generation, physical reasoning, and action generation in a single architecture. Available in 8B (Nano) and 32B (Super) parameter versions on Hugging Face, Cosmos 3 uses a Mixture-of-Transformers architecture to process text, image, video, audio, and action modalities without switching between separate models.
Mistral AI Releases Small 4: 119B Parameter Open-Source Model with 256K Context Under Apache 2.0
Mistral AI has released Mistral Small 4, a 119B total parameter mixture-of-experts model with 256K context window and native multimodal capabilities. The model uses 128 experts with 4 active per token (6B active parameters) and is released under the Apache 2.0 license, marking Mistral's first unified model combining reasoning, multimodal, and coding capabilities.
Mistral Releases Mistral Large 3 with 675B Parameters and Three Ministral 3 Models Under Apache 2.0
Mistral AI has released Mistral 3, consisting of Mistral Large 3—a sparse mixture-of-experts model with 675B total parameters and 41B active parameters—and three Ministral 3 models at 3B, 8B, and 14B parameters. All models are released under the Apache 2.0 license with multimodal capabilities including image understanding.
Comments
Loading...