researchNVIDIA

NVIDIA releases LoRA/DoRA fine-tuning guide for Cosmos Predict 2.5 to generate synthetic robot training data

TL;DR

NVIDIA published a technical guide for parameter-efficient fine-tuning of its Cosmos Predict 2.5 world model using LoRA and DoRA adapters. The method allows teams to adapt the 2B-parameter model to robot manipulation tasks on a single 80GB GPU, generating synthetic training trajectories from just 92 demonstration videos.

2 min read
0

NVIDIA releases LoRA/DoRA fine-tuning guide for Cosmos Predict 2.5 to generate synthetic robot training data

NVIDIA published a technical guide for parameter-efficient fine-tuning of its Cosmos Predict 2.5 world model using LoRA and DoRA adapters. The method allows teams to adapt the 2B-parameter model to robot manipulation tasks on a single 80GB GPU, generating synthetic training trajectories from just 92 demonstration videos.

Technical implementation

Cosmos Predict 2.5 is a video generation model that produces physically plausible videos from text, images, or video clips. The fine-tuning approach injects small trainable adapter modules into the frozen base model, avoiding the cost and catastrophic forgetting risks of full fine-tuning.

The implementation targets three components: a VAE for encoding videos to latents, a text encoder, and a DiT (Diffusion Transformer) for latent-space diffusion. All base weights remain frozen. LoRA adapters are injected only into the DiT's attention projections (to_q, to_k, to_v, to_out.0) and feedforward layers (ff.net.0.proj, ff.net.2). Trainable LoRA parameters are upcast to float32 for numerical stability under bf16 mixed precision.

Training approach

The model uses rectified flow, training to predict the velocity that linearly transports noise to clean data. At timestep t, the model constructs a noisy interpolation xt = σt·noise + (1−σt)·clean and learns to predict the target velocity noise − clean via MSE loss. The first two video frames serve as conditioning and receive no noise.

NVIDIA's reference training uses 92 robot manipulation videos with text prompts describing pick-and-place tasks, evaluated against 50 (prompt, image) pairs. The VideoDataset loader samples random contiguous windows of frames from longer videos each epoch for temporal augmentation.

Hardware and configuration

Minimum requirements: one 80GB GPU for single-GPU training. NVIDIA recommends 8× H100s for faster iteration. The guide provides a training script using diffusers and accelerate libraries with support for both single- and multi-GPU configurations.

Switching from LoRA to DoRA requires only setting use_dora=True in the LoraConfig. DoRA decomposes each weight into magnitude and direction before applying the low-rank update, with no other training loop changes required.

The optimizer is AdamW with linear warmup over scheduler_warm_up_steps, peaking at scheduler_f_max × learning_rate, then linear decay to scheduler_f_min × learning_rate. Checkpoints save as pytorch_lora_weights.safetensors files every specified number of epochs.

What this means

Parameter-efficient fine-tuning addresses a critical bottleneck in robot learning: collecting real-world demonstration data is slow and expensive. This guide provides a practical path to generate synthetic training data by adapting a general-purpose world model to specific robotic tasks with modest compute requirements. The approach keeps adapter files small and portable, enabling teams to swap different domain adapters at inference time without maintaining separate full model copies. The 92-video training set demonstrates that meaningful domain adaptation is possible with limited data when starting from a capable foundation model.

Related Articles

model release

NVIDIA Releases Cosmos3-Super: 64B-Parameter Omnimodal World Model for Physical AI

NVIDIA released Cosmos3-Super, a 64-billion parameter omnimodal foundation model that generates video, images, audio, and action commands from combinations of text, image, video, and action trajectory inputs. The model, part of the Cosmos3 collection, targets Physical AI applications including robotics, autonomous vehicles, and industrial automation.

model release

NVIDIA Releases Cosmos 3: 64B-Parameter Omnimodal World Model for Physical AI

NVIDIA released Cosmos 3, an omnimodal world foundation model platform for Physical AI spanning robotics, autonomous driving, and industrial environments. The flagship Cosmos3-Super variant contains 64 billion parameters and generates video, images, audio, and action commands from text, image, video, and action trajectory inputs using a Mixture-of-Transformers architecture.

research

Mistral AI fine-tunes Pixtral-12B on satellite imagery, boosting classification accuracy from 56% to 91%

Mistral AI has published research showing that fine-tuning its Pixtral-12B vision language model on satellite imagery increases classification accuracy from 56% to 91% on the Aerial Image Dataset. Using Low-Rank Adaptation (LoRA) with 8,000 training samples across 30 scene categories, the company reduced hallucinations from 5% to 0.1% for under $10 in compute costs.

model release

NVIDIA Releases Cosmos3-Super-Text2Image: 64B Parameter Model for Physical AI Applications

NVIDIA released Cosmos3-Super-Text2Image, a 64-billion parameter text-to-image generation model as part of its Cosmos3 collection of omnimodal world models. The model uses a Mixture-of-Transformers architecture combining autoregressive and diffusion transformers, designed for Physical AI applications including robotics and autonomous vehicles.

Comments

Loading...