model releaseNVIDIA

NVIDIA Releases Cosmos3-Super-Text2Image: 64B Parameter Model for Physical AI Applications

TL;DR

NVIDIA released Cosmos3-Super-Text2Image, a 64-billion parameter text-to-image generation model as part of its Cosmos3 collection of omnimodal world models. The model uses a Mixture-of-Transformers architecture combining autoregressive and diffusion transformers, designed for Physical AI applications including robotics and autonomous vehicles.

June 2, 2026 · 5:51 PM2 min read

Cosmos3-Super-Text2Image — Quick Specs

Context window4K tokens

Compare Cosmos3-Super-Text2Image with other models →

NVIDIA Releases Cosmos3-Super-Text2Image: 64B Parameter Model for Physical AI Applications

NVIDIA released Cosmos3-Super-Text2Image, a 64-billion parameter text-to-image generation model as part of its Cosmos3 collection of omnimodal world models. The model is available now on Hugging Face and GitHub under the OpenMDW1.1 license for commercial and non-commercial use.

Model Architecture and Specifications

Cosmos3-Super-Text2Image is built on a Mixture-of-Transformers (MoT) architecture consisting of two complementary transformer towers: an autoregressive transformer for discrete token generation and a diffusion transformer for continuous multimodal generation. The model generates high-fidelity JPEG images from text descriptions at resolutions including 256p, 480p, and 720p across multiple aspect ratios (16:9, 4:3, 1:1, 3:4, 9:16).

The model accepts text input up to 4,096 tokens and outputs two-dimensional RGB images in JPG format. NVIDIA specifies that only BF16 precision is officially tested and supported.

Broader Cosmos3 Platform

Cosmos3-Super-Text2Image is one of five models in the Cosmos3 collection:

Cosmos3-Nano: 16B parameters for multimodal understanding and generation
Cosmos3-Super: 64B parameters for multimodal understanding and generation
Cosmos3-Nano-Policy-DROID: 16B parameters for robot action trajectory generation
Cosmos3-Super-Image2Video: 64B parameters for video generation from images
Cosmos3-Super-Text2Image: 64B parameters for text-to-image generation

According to NVIDIA, the Cosmos platform is designed to accelerate Physical AI development by enabling machines to understand, simulate, and interact with the physical world across robotics, autonomous driving, and smart space environments.

Technical Requirements

The model requires NVIDIA GPU-accelerated systems running on Ampere, Hopper, or Blackwell microarchitectures. Supported runtime engines include PyTorch, vLLM-Omni, and Hugging Face Diffusers. NVIDIA has tested the model on Linux operating systems only.

Pricing information has not been disclosed.

What This Means

Cosmos3 represents NVIDIA's entry into foundation models for Physical AI applications, directly competing with general-purpose multimodal models from OpenAI, Anthropic, and Google. The unified architecture handling multiple modalities within a single framework could reduce deployment complexity for robotics and autonomous systems developers. However, the 64B parameter count and requirement for NVIDIA-specific hardware may limit accessibility compared to smaller, hardware-agnostic alternatives. The lack of disclosed pricing and benchmark scores makes performance comparison with existing text-to-image models like Stable Diffusion 3 or DALL-E 3 impossible at this stage.

Source: huggingface.co ↗

NVIDIA Cosmos3 text-to-image Physical AI robotics multimodal Mixture-of-Transformers diffusion

model releaseJuly 16, 2026

Nvidia Launches Cosmos 3 Edge World Model for Physical AI, Forms Japan Industrial Coalition

Nvidia released Cosmos 3 Edge, a world model designed for robots and vision AI agents to perceive and navigate physical environments in real time. The company announced partnerships with Japanese industrial giants including Fujitsu, Hitachi, and Kawasaki Heavy Industries as part of its physical AI expansion.

product updateJuly 17, 2026

NVIDIA NeMo Automodel integrates with Hugging Face Diffusers for distributed video and image model fine-tuning

NVIDIA and Hugging Face have integrated NeMo Automodel with the Diffusers library, enabling distributed fine-tuning of video and image diffusion models without checkpoint conversion. The integration supports models including FLUX.1-dev (12B), Wan 2.1 (1.3B/14B), and HunyuanVideo (13B) with full fine-tuning and LoRA options.

benchmarkJuly 16, 2026

NVIDIA Nemotron 3 Embed 8B Tops RTEB Leaderboard with 78.5% Score, 1B Variant Cuts Error Rate 27%

NVIDIA's Nemotron-3-Embed-8B-BF16 ranks #1 on the RTEB leaderboard with a 78.5% score, while the 1B variant reduces error rate by 27% over its predecessor. The open-weight models feature 32k context windows and production-ready deployment options including a Blackwell-optimized NVFP4 variant.

model releaseJuly 16, 2026

Thinking Machines Lab releases Inkling: 975B-parameter open-weights multimodal model under Apache-2.0

Thinking Machines Lab released Inkling, a Mixture-of-Experts transformer with 975B total parameters and 41B active parameters, trained on 45 trillion tokens of text, images, audio and video. The Apache-2.0 licensed model is designed as a base for fine-tuning rather than a frontier model.

NVIDIA Releases Cosmos3-Super-Text2Image: 64B Parameter Model for Physical AI Applications

Cosmos3-Super-Text2Image — Quick Specs

NVIDIA Releases Cosmos3-Super-Text2Image: 64B Parameter Model for Physical AI Applications

Model Architecture and Specifications

Broader Cosmos3 Platform

Technical Requirements

What This Means

Related Articles

Nvidia Launches Cosmos 3 Edge World Model for Physical AI, Forms Japan Industrial Coalition

NVIDIA NeMo Automodel integrates with Hugging Face Diffusers for distributed video and image model fine-tuning

NVIDIA Nemotron 3 Embed 8B Tops RTEB Leaderboard with 78.5% Score, 1B Variant Cuts Error Rate 27%

Thinking Machines Lab releases Inkling: 975B-parameter open-weights multimodal model under Apache-2.0

Comments