model releaseNVIDIA

NVIDIA Releases Cosmos 3: 64B-Parameter Omnimodal World Model for Physical AI

TL;DR

NVIDIA released Cosmos 3, an omnimodal world foundation model platform for Physical AI spanning robotics, autonomous driving, and industrial environments. The flagship Cosmos3-Super variant contains 64 billion parameters and generates video, images, audio, and action commands from text, image, video, and action trajectory inputs using a Mixture-of-Transformers architecture.

June 2, 2026 · 8:50 AM2 min read

Cosmos 3 Super Image2Video — Quick Specs

Context window262K tokens

Compare Cosmos 3 Super Image2Video with other models →

NVIDIA Releases Cosmos 3: 64B-Parameter Omnimodal World Model for Physical AI

NVIDIA released Cosmos 3, an omnimodal world foundation model platform designed to accelerate Physical AI development across robotics, autonomous vehicles, and industrial environments. The model collection is available on Hugging Face and GitHub as of May 31, 2026.

Model Specifications

Cosmos 3 comes in five variants with parameter counts ranging from 16B to 64B:

Cosmos3-Nano: 16 billion parameters
Cosmos3-Super: 64 billion parameters
Cosmos3-Nano-Policy-DROID: 16 billion parameters (robotics-specific)
Cosmos3-Super-Image2Video: 64 billion parameters
Cosmos3-Super-Text2Image: 64 billion parameters

All models are released under the OpenMDW1.1 license for commercial and non-commercial use.

Technical Architecture

Cosmos 3 uses a Mixture-of-Transformers (MoT) architecture consisting of two complementary transformer towers: an autoregressive transformer for discrete token generation and a diffusion transformer for continuous multimodal generation. Text generates through standard next-token autoregressive decoding, while non-text modalities synthesize through iterative denoising.

Input and Output Capabilities

The models accept multimodal inputs including:

Text: Up to 256K tokens context window for reasoning tasks
Images: 256p, 480p, and 720p at aspect ratios 16:9, 4:3, 1:1, 3:4, 9:16
Video: Up to 5 input frames at the same resolutions
Audio: 48 kHz stereo with up to 0.5 second duration
Action trajectories: Compatible with 10 robot embodiments including Franka Panda, UR, Google robot, and WidowX 250

Outputs include video generation from 5 to 400 frames (default 189 frames), images in JPEG format, 48 kHz stereo AAC audio, and robot action sequences.

Robot Platform Support

Cosmos 3 supports action generation for specific robot platforms with dimensionality ranging from 9D (camera motion, UMI) to 57D (egocentric motion). Compatible embodiments include single and dual Franka Panda arms with RobotiQ grippers (10D and 20D), Agibot (29D), and autonomous vehicle control (9D).

Hardware Requirements

According to NVIDIA, the models are optimized for NVIDIA Ampere, Hopper, and Blackwell GPU architectures running Linux. Only BF16 precision is officially supported and tested. The models integrate with PyTorch, vLLM-Omni, and Hugging Face Diffusers runtimes.

What This Means

Cosmos 3 represents NVIDIA's entry into omnimodal world models that bridge multiple AI modalities—vision, language, audio, and robotic control—within a unified architecture. The 256K token context window for reasoning and support for 10 robot embodiments positions it for physical AI applications requiring long-horizon planning. The dual-transformer architecture's separation of discrete and continuous generation mechanisms addresses a core challenge in multimodal modeling. Pricing information has not been disclosed.

Source: huggingface.co ↗

nvidia world-models physical-ai robotics multimodal video-generation autonomous-vehicles

model releaseJuly 16, 2026

Nvidia Launches Cosmos 3 Edge World Model for Physical AI, Forms Japan Industrial Coalition

Nvidia released Cosmos 3 Edge, a world model designed for robots and vision AI agents to perceive and navigate physical environments in real time. The company announced partnerships with Japanese industrial giants including Fujitsu, Hitachi, and Kawasaki Heavy Industries as part of its physical AI expansion.

benchmarkJuly 16, 2026

NVIDIA Nemotron 3 Embed 8B Tops RTEB Leaderboard with 78.5% Score, 1B Variant Cuts Error Rate 27%

NVIDIA's Nemotron-3-Embed-8B-BF16 ranks #1 on the RTEB leaderboard with a 78.5% score, while the 1B variant reduces error rate by 27% over its predecessor. The open-weight models feature 32k context windows and production-ready deployment options including a Blackwell-optimized NVFP4 variant.

model releaseJuly 16, 2026

Thinking Machines Lab releases Inkling: 975B-parameter open-weights multimodal model under Apache-2.0

Thinking Machines Lab released Inkling, a Mixture-of-Experts transformer with 975B total parameters and 41B active parameters, trained on 45 trillion tokens of text, images, audio and video. The Apache-2.0 licensed model is designed as a base for fine-tuning rather than a frontier model.

model releaseJuly 16, 2026

Moonshot AI Releases Kimi K3: Open-Weight Multimodal Reasoning Model with 1M Context Window

Moonshot AI has released Kimi K3, an open-weight multimodal reasoning model with a 1-million token context window. The model is priced at $3 per 1M input tokens and $15 per 1M output tokens, available through OpenRouter.

NVIDIA Releases Cosmos 3: 64B-Parameter Omnimodal World Model for Physical AI

Cosmos 3 Super Image2Video — Quick Specs

NVIDIA Releases Cosmos 3: 64B-Parameter Omnimodal World Model for Physical AI

Model Specifications

Technical Architecture

Input and Output Capabilities

Robot Platform Support

Hardware Requirements

What This Means

Related Articles

Nvidia Launches Cosmos 3 Edge World Model for Physical AI, Forms Japan Industrial Coalition

NVIDIA Nemotron 3 Embed 8B Tops RTEB Leaderboard with 78.5% Score, 1B Variant Cuts Error Rate 27%

Thinking Machines Lab releases Inkling: 975B-parameter open-weights multimodal model under Apache-2.0

Moonshot AI Releases Kimi K3: Open-Weight Multimodal Reasoning Model with 1M Context Window

Comments