model releaseNVIDIA

NVIDIA Releases Cosmos 3: 8B and 32B Omni-Models Combining Video Generation, Reasoning, and Action in Single Architectur

TL;DR

NVIDIA has released Cosmos 3, a unified omni-model that combines world generation, physical reasoning, and action generation in a single architecture. Available in 8B (Nano) and 32B (Super) parameter versions on Hugging Face, Cosmos 3 uses a Mixture-of-Transformers architecture to process text, image, video, audio, and action modalities without switching between separate models.

June 1, 2026 · 4:51 AM2 min read

Cosmos 3 Super — Quick Specs

Compare Cosmos 3 Super with other models →

NVIDIA Releases Cosmos 3: 8B and 32B Omni-Models Combining Video Generation, Reasoning, and Action in Single Architecture

NVIDIA has released Cosmos 3, a unified omni-model that eliminates the need for multiple specialized models in physical AI applications. The model combines video generation, physical reasoning, and action prediction capabilities in a single architecture, available immediately on Hugging Face.

Two Model Sizes

Cosmos 3 ships in two versions:

Cosmos 3 Nano: 8B parameters (8B reasoner + 8B generator), optimized for workstation-grade hardware like RTX PRO 6000 GPUs
Cosmos 3 Super: 32B parameters (32B reasoner + 32B generator), designed for large-scale synthetic data generation on NVIDIA Hopper and Blackwell GPUs

Both models are available on Hugging Face at nvidia/Cosmos3-Nano and nvidia/Cosmos3-Super.

Unified Architecture

Previous Cosmos releases required developers to use separate models for different tasks: Cosmos Predict for world generation, Cosmos Transfer for controlled generation, Cosmos Reason for scene understanding, and Cosmos Policy for action generation. Cosmos 3 replaces this fragmented approach with a single Mixture-of-Transformers (MoT) architecture.

The model processes all modalities - text, image, video, audio, and action - within a unified architecture. Each modality is encoded by a dedicated encoder, then projected into a shared representation space. The input sequence splits into two subsequences: an autoregressive (AR) subsequence for reasoning via next-token prediction, and a diffusion (DM) subsequence for generation via iterative denoising. Separate parameter sets handle AR and DM tokens within each transformer layer, but interact through joint attention.

Capabilities

Cosmos 3 supports multiple input-output combinations in one model:

Text/image/video to video generation
Text/video to text (Vision Language Model)
Action/image/text to video (forward dynamics)
Text/video to action (inverse dynamics)
Image/text to video and action (policy model)

NVIDIA claims the model can generate "realistic and physically plausible video worlds" and reason about motion, causality, and spatial relationships.

Diffusers Integration

Cosmos 3 integrates with Hugging Face's Diffusers library through the Cosmos3OmniPipeline class. A text-to-image example requires minimal code:

import torch
from diffusers import Cosmos3OmniPipeline

pipe = Cosmos3OmniPipeline.from_pretrained(
    "nvidia/Cosmos3-Nano",
    torch_dtype=torch.bfloat16,
    device_map="cuda"
)

result = pipe(prompt=prompt, num_frames=1, height=720, width=1280)

Training and Data

NVIDIA released post-training scripts on GitHub for fine-tuning Cosmos 3 on custom data. The company also published synthetic data generation (SDG) datasets on Hugging Face, including:

Embodied-Robot-Scenes for robotics
Physical-Interaction-Scenes from Isaac Sim
Spatial-Reasoning data
Digital-Human-Scenes for human motion

Pricing, training data cutoff date, and specific benchmark scores were not disclosed.

What This Means

Cosmos 3's unified architecture addresses a real friction point in physical AI development: managing multiple specialized models. By combining video generation, reasoning, and action prediction in one model, NVIDIA simplifies the pipeline for robotics, autonomous vehicles, and synthetic data generation use cases. The 8B Nano version's ability to run on workstation hardware makes these capabilities accessible beyond data center deployments. However, the lack of benchmark scores or comparisons to competing models makes it difficult to assess performance claims independently.

Source: huggingface.co ↗

nvidia multimodal video-generation robotics physical-ai diffusers open-source

model releaseJuly 15, 2026

Mira Murati's Thinking Machines releases Inkling, 975B-parameter open-weight model trained on 45T tokens

Thinking Machines Lab released Inkling, a 975-billion-parameter mixture-of-experts model that uses 41 billion active parameters per task. The open-weight model was trained on 45 trillion tokens across text, image, audio, and video, marking the first public release from Mira Murati's AI startup.

model releaseJuly 14, 2026

Google releases Gemma 4 E2B, optimized to run natively on Pixel 10's Tensor G5 TPU

Google has released Gemma 4 E2B for TPU, a variant of its open-source Gemma 4 model optimized to run natively on the Tensor G5 chip in Pixel 10 devices. The multimodal model enables completely offline AI chat, image recognition, and audio transcription on Pixel 10, 10 Pro, 10 Pro XL, and 10 Pro Fold.

model releaseJuly 11, 2026

Cohere releases 2B parameter Arabic speech recognition model with 25.9% average WER

Cohere and Cohere Labs released Cohere Transcribe Arabic, a 2B parameter automatic speech recognition model optimized for Arabic dialects and Arabic-English code-switching. The open-source model achieves a 25.9% average word error rate across major Arabic ASR benchmarks, outperforming models up to 30B parameters.

product updateJuly 10, 2026

AWS Adds NVIDIA Nemotron 3 Nano (30B) and Super (120B) to SageMaker Serverless Fine-Tuning

Amazon SageMaker AI now supports serverless fine-tuning for NVIDIA Nemotron 3 Nano (30B parameters, 3B active) and Nemotron 3 Super (120B parameters, 12B active). The integration includes supervised fine-tuning, reinforcement learning with verifiable rewards (RLVR), and reinforcement learning from AI feedback (RLAIF).