model releaseNVIDIA

NVIDIA Releases Cosmos3-Super: 64B-Parameter Omnimodal World Model for Physical AI

TL;DR

NVIDIA released Cosmos3-Super, a 64-billion parameter omnimodal foundation model that generates video, images, audio, and action commands from combinations of text, image, video, and action trajectory inputs. The model, part of the Cosmos3 collection, targets Physical AI applications including robotics, autonomous vehicles, and industrial automation.

June 2, 2026 · 6:50 AM2 min read

NVIDIA Cosmos3-Super — Quick Specs

Context window256K tokens

Compare NVIDIA Cosmos3-Super with other models →

NVIDIA Releases Cosmos3-Super: 64B-Parameter Omnimodal World Model for Physical AI

NVIDIA released Cosmos3-Super, a 64-billion parameter omnimodal foundation model designed for Physical AI applications across robotics, autonomous driving, and industrial environments. The model generates video, images, audio, and robot action commands from multimodal inputs.

Model Architecture and Specifications

Cosmos3-Super uses a Mixture-of-Transformers (MoT) architecture combining an autoregressive transformer for text generation with a diffusion transformer for continuous multimodal outputs. The model processes text, images, video (with or without audio), and action trajectories as inputs.

The model supports context windows up to 256K tokens for reasoning tasks and accepts video inputs at resolutions up to 720p across multiple aspect ratios (16:9, 4:3, 1:1, 3:4, 9:16). Video generation handles 5 to 400 frames, with 189 frames as the default duration.

Model Collection

NVIDIA released four variants:

Cosmos3-Nano: 16B parameters for multimodal understanding and generation
Cosmos3-Super: 64B parameters for advanced world simulation
Cosmos3-Nano-Policy-DROID: 16B parameters fine-tuned for DROID robot platform
Cosmos3-Super-Image2Video: 64B parameters specialized for image-to-video generation
Cosmos3-Super-Text2Image: 64B parameters for text-to-image synthesis

Technical Capabilities

The model supports multiple robot embodiments including Franka Panda arms, Agibot, UR robots, Google robots, WidowX 250, and UMI platforms. Action outputs are dimensioned for specific embodiments, ranging from 9D for camera motion to 57D for egocentric motion.

Audio processing operates at 48 kHz stereo with AAC encoding. Video inputs require 4 fps for optimal reasoning performance.

Availability and Requirements

The models are available on Hugging Face and GitHub under the OpenMDW1.1 license for commercial and non-commercial use. NVIDIA states the models are optimized for NVIDIA Ampere, Hopper, and Blackwell architectures running Linux. Only BF16 precision is officially tested and supported.

Supported runtimes include PyTorch, vLLM-Omni, and Hugging Face Diffusers.

What This Means

Cosmos3-Super represents NVIDIA's entry into world models for embodied AI, directly competing with approaches from companies like OpenAI and Google DeepMind in the Physical AI space. The 256K token context window and native action trajectory generation distinguish it from vision-language models without embodied AI capabilities. The release of specialized variants for specific platforms (DROID) suggests NVIDIA is positioning Cosmos as both a research foundation and a commercial robotics development tool. Pricing has not been disclosed, which will be critical for adoption given the model's 64B parameter scale and GPU requirements.

Source: huggingface.co ↗

nvidia world-models physical-ai robotics autonomous-vehicles multimodal video-generation cosmos

model releaseJuly 16, 2026

Nvidia Launches Cosmos 3 Edge World Model for Physical AI, Forms Japan Industrial Coalition

Nvidia released Cosmos 3 Edge, a world model designed for robots and vision AI agents to perceive and navigate physical environments in real time. The company announced partnerships with Japanese industrial giants including Fujitsu, Hitachi, and Kawasaki Heavy Industries as part of its physical AI expansion.

benchmarkJuly 16, 2026

NVIDIA Nemotron 3 Embed 8B Tops RTEB Leaderboard with 78.5% Score, 1B Variant Cuts Error Rate 27%

NVIDIA's Nemotron-3-Embed-8B-BF16 ranks #1 on the RTEB leaderboard with a 78.5% score, while the 1B variant reduces error rate by 27% over its predecessor. The open-weight models feature 32k context windows and production-ready deployment options including a Blackwell-optimized NVFP4 variant.

model releaseJuly 16, 2026

Thinking Machines Lab releases Inkling: 975B-parameter open-weights multimodal model under Apache-2.0

Thinking Machines Lab released Inkling, a Mixture-of-Experts transformer with 975B total parameters and 41B active parameters, trained on 45 trillion tokens of text, images, audio and video. The Apache-2.0 licensed model is designed as a base for fine-tuning rather than a frontier model.

model releaseJuly 16, 2026

Moonshot AI Releases Kimi K3: Open-Weight Multimodal Reasoning Model with 1M Context Window

Moonshot AI has released Kimi K3, an open-weight multimodal reasoning model with a 1-million token context window. The model is priced at $3 per 1M input tokens and $15 per 1M output tokens, available through OpenRouter.

NVIDIA Releases Cosmos3-Super: 64B-Parameter Omnimodal World Model for Physical AI

NVIDIA Cosmos3-Super — Quick Specs

NVIDIA Releases Cosmos3-Super: 64B-Parameter Omnimodal World Model for Physical AI

Model Architecture and Specifications

Model Collection

Technical Capabilities

Availability and Requirements

What This Means

Related Articles

Nvidia Launches Cosmos 3 Edge World Model for Physical AI, Forms Japan Industrial Coalition

NVIDIA Nemotron 3 Embed 8B Tops RTEB Leaderboard with 78.5% Score, 1B Variant Cuts Error Rate 27%

Thinking Machines Lab releases Inkling: 975B-parameter open-weights multimodal model under Apache-2.0

Moonshot AI Releases Kimi K3: Open-Weight Multimodal Reasoning Model with 1M Context Window

Comments