model releaseNVIDIA

NVIDIA Releases Cosmos3-Super-Text2Image: 64B Parameter Model for Physical AI Applications

TL;DR

NVIDIA released Cosmos3-Super-Text2Image, a 64-billion parameter text-to-image generation model as part of its Cosmos3 collection of omnimodal world models. The model uses a Mixture-of-Transformers architecture combining autoregressive and diffusion transformers, designed for Physical AI applications including robotics and autonomous vehicles.

2 min read
0

NVIDIA Releases Cosmos3-Super-Text2Image: 64B Parameter Model for Physical AI Applications

NVIDIA released Cosmos3-Super-Text2Image, a 64-billion parameter text-to-image generation model as part of its Cosmos3 collection of omnimodal world models. The model is available now on Hugging Face and GitHub under the OpenMDW1.1 license for commercial and non-commercial use.

Model Architecture and Specifications

Cosmos3-Super-Text2Image is built on a Mixture-of-Transformers (MoT) architecture consisting of two complementary transformer towers: an autoregressive transformer for discrete token generation and a diffusion transformer for continuous multimodal generation. The model generates high-fidelity JPEG images from text descriptions at resolutions including 256p, 480p, and 720p across multiple aspect ratios (16:9, 4:3, 1:1, 3:4, 9:16).

The model accepts text input up to 4,096 tokens and outputs two-dimensional RGB images in JPG format. NVIDIA specifies that only BF16 precision is officially tested and supported.

Broader Cosmos3 Platform

Cosmos3-Super-Text2Image is one of five models in the Cosmos3 collection:

  • Cosmos3-Nano: 16B parameters for multimodal understanding and generation
  • Cosmos3-Super: 64B parameters for multimodal understanding and generation
  • Cosmos3-Nano-Policy-DROID: 16B parameters for robot action trajectory generation
  • Cosmos3-Super-Image2Video: 64B parameters for video generation from images
  • Cosmos3-Super-Text2Image: 64B parameters for text-to-image generation

According to NVIDIA, the Cosmos platform is designed to accelerate Physical AI development by enabling machines to understand, simulate, and interact with the physical world across robotics, autonomous driving, and smart space environments.

Technical Requirements

The model requires NVIDIA GPU-accelerated systems running on Ampere, Hopper, or Blackwell microarchitectures. Supported runtime engines include PyTorch, vLLM-Omni, and Hugging Face Diffusers. NVIDIA has tested the model on Linux operating systems only.

Pricing information has not been disclosed.

What This Means

Cosmos3 represents NVIDIA's entry into foundation models for Physical AI applications, directly competing with general-purpose multimodal models from OpenAI, Anthropic, and Google. The unified architecture handling multiple modalities within a single framework could reduce deployment complexity for robotics and autonomous systems developers. However, the 64B parameter count and requirement for NVIDIA-specific hardware may limit accessibility compared to smaller, hardware-agnostic alternatives. The lack of disclosed pricing and benchmark scores makes performance comparison with existing text-to-image models like Stable Diffusion 3 or DALL-E 3 impossible at this stage.

Related Articles

model release

NVIDIA Releases Cosmos3-Nano: 16B-Parameter Omnimodal World Model for Physical AI with 256K Token Context

NVIDIA has released Cosmos3-Nano, a 16-billion parameter omnimodal world model capable of generating video, audio, images, and robot action commands from combinations of text, image, video, and action trajectory inputs. The model supports a 256K token context window and is designed for Physical AI applications including robotics, autonomous vehicles, and smart manufacturing environments.

model release

NVIDIA Releases Cosmos 3: 64B-Parameter Omnimodal World Model for Physical AI

NVIDIA released Cosmos 3, an omnimodal world foundation model platform for Physical AI spanning robotics, autonomous driving, and industrial environments. The flagship Cosmos3-Super variant contains 64 billion parameters and generates video, images, audio, and action commands from text, image, video, and action trajectory inputs using a Mixture-of-Transformers architecture.

model release

NVIDIA Releases Cosmos3-Super: 64B-Parameter Omnimodal World Model for Physical AI

NVIDIA released Cosmos3-Super, a 64-billion parameter omnimodal foundation model that generates video, images, audio, and action commands from combinations of text, image, video, and action trajectory inputs. The model, part of the Cosmos3 collection, targets Physical AI applications including robotics, autonomous vehicles, and industrial automation.

model release

NVIDIA Releases Cosmos 3: 8B and 32B Omni-Models Combining Video Generation, Reasoning, and Action in Single Architectur

NVIDIA has released Cosmos 3, a unified omni-model that combines world generation, physical reasoning, and action generation in a single architecture. Available in 8B (Nano) and 32B (Super) parameter versions on Hugging Face, Cosmos 3 uses a Mixture-of-Transformers architecture to process text, image, video, audio, and action modalities without switching between separate models.

Comments

Loading...