model releaseNVIDIA

NVIDIA Releases Cosmos 3: 8B and 32B Omni-Models Combining Video Generation, Reasoning, and Action in Single Architectur

TL;DR

NVIDIA has released Cosmos 3, a unified omni-model that combines world generation, physical reasoning, and action generation in a single architecture. Available in 8B (Nano) and 32B (Super) parameter versions on Hugging Face, Cosmos 3 uses a Mixture-of-Transformers architecture to process text, image, video, audio, and action modalities without switching between separate models.

2 min read
0

NVIDIA Releases Cosmos 3: 8B and 32B Omni-Models Combining Video Generation, Reasoning, and Action in Single Architecture

NVIDIA has released Cosmos 3, a unified omni-model that eliminates the need for multiple specialized models in physical AI applications. The model combines video generation, physical reasoning, and action prediction capabilities in a single architecture, available immediately on Hugging Face.

Two Model Sizes

Cosmos 3 ships in two versions:

  • Cosmos 3 Nano: 8B parameters (8B reasoner + 8B generator), optimized for workstation-grade hardware like RTX PRO 6000 GPUs
  • Cosmos 3 Super: 32B parameters (32B reasoner + 32B generator), designed for large-scale synthetic data generation on NVIDIA Hopper and Blackwell GPUs

Both models are available on Hugging Face at nvidia/Cosmos3-Nano and nvidia/Cosmos3-Super.

Unified Architecture

Previous Cosmos releases required developers to use separate models for different tasks: Cosmos Predict for world generation, Cosmos Transfer for controlled generation, Cosmos Reason for scene understanding, and Cosmos Policy for action generation. Cosmos 3 replaces this fragmented approach with a single Mixture-of-Transformers (MoT) architecture.

The model processes all modalities - text, image, video, audio, and action - within a unified architecture. Each modality is encoded by a dedicated encoder, then projected into a shared representation space. The input sequence splits into two subsequences: an autoregressive (AR) subsequence for reasoning via next-token prediction, and a diffusion (DM) subsequence for generation via iterative denoising. Separate parameter sets handle AR and DM tokens within each transformer layer, but interact through joint attention.

Capabilities

Cosmos 3 supports multiple input-output combinations in one model:

  • Text/image/video to video generation
  • Text/video to text (Vision Language Model)
  • Action/image/text to video (forward dynamics)
  • Text/video to action (inverse dynamics)
  • Image/text to video and action (policy model)

NVIDIA claims the model can generate "realistic and physically plausible video worlds" and reason about motion, causality, and spatial relationships.

Diffusers Integration

Cosmos 3 integrates with Hugging Face's Diffusers library through the Cosmos3OmniPipeline class. A text-to-image example requires minimal code:

import torch
from diffusers import Cosmos3OmniPipeline

pipe = Cosmos3OmniPipeline.from_pretrained(
    "nvidia/Cosmos3-Nano",
    torch_dtype=torch.bfloat16,
    device_map="cuda"
)

result = pipe(prompt=prompt, num_frames=1, height=720, width=1280)

Training and Data

NVIDIA released post-training scripts on GitHub for fine-tuning Cosmos 3 on custom data. The company also published synthetic data generation (SDG) datasets on Hugging Face, including:

  • Embodied-Robot-Scenes for robotics
  • Physical-Interaction-Scenes from Isaac Sim
  • Spatial-Reasoning data
  • Digital-Human-Scenes for human motion

Pricing, training data cutoff date, and specific benchmark scores were not disclosed.

What This Means

Cosmos 3's unified architecture addresses a real friction point in physical AI development: managing multiple specialized models. By combining video generation, reasoning, and action prediction in one model, NVIDIA simplifies the pipeline for robotics, autonomous vehicles, and synthetic data generation use cases. The 8B Nano version's ability to run on workstation hardware makes these capabilities accessible beyond data center deployments. However, the lack of benchmark scores or comparisons to competing models makes it difficult to assess performance claims independently.

Related Articles

model release

Mistral AI Releases Small 4: 119B Parameter Open-Source Model with 256K Context Under Apache 2.0

Mistral AI has released Mistral Small 4, a 119B total parameter mixture-of-experts model with 256K context window and native multimodal capabilities. The model uses 128 experts with 4 active per token (6B active parameters) and is released under the Apache 2.0 license, marking Mistral's first unified model combining reasoning, multimodal, and coding capabilities.

model release

Mistral Releases Mistral Large 3 with 675B Parameters and Three Ministral 3 Models Under Apache 2.0

Mistral AI has released Mistral 3, consisting of Mistral Large 3—a sparse mixture-of-experts model with 675B total parameters and 41B active parameters—and three Ministral 3 models at 3B, 8B, and 14B parameters. All models are released under the Apache 2.0 license with multimodal capabilities including image understanding.

model release

NVIDIA releases LocateAnything-3B vision-language model with 2.5× faster object detection via parallel box decoding

NVIDIA released LocateAnything-3B, a 3-billion parameter vision-language model that predicts bounding boxes in parallel rather than token-by-token, achieving up to 2.5× higher throughput compared to autoregressive approaches. The model, trained on 12M images with 138M+ queries and 785M bounding boxes, supports object detection, GUI element grounding, and robotics perception.

model release

Mistral AI Releases Voxtral: Apache 2.0 Speech Models with 32K Token Context at $0.001/Minute

Mistral AI released Voxtral, a family of open-source speech understanding models available in 24B and 3B parameter variants under Apache 2.0 license. The models support up to 32K token context (30 minutes of audio for transcription, 40 minutes for understanding) and are priced at $0.001 per minute via API—less than half the cost of comparable proprietary systems according to Mistral.

Comments

Loading...