NVIDIA releases LoRA/DoRA fine-tuning guide for Cosmos Predict 2.5 to generate synthetic robot training data
NVIDIA published a technical guide for parameter-efficient fine-tuning of its Cosmos Predict 2.5 world model using LoRA and DoRA adapters. The method allows teams to adapt the 2B-parameter model to robot manipulation tasks on a single 80GB GPU, generating synthetic training trajectories from just 92 demonstration videos.
NVIDIA releases LoRA/DoRA fine-tuning guide for Cosmos Predict 2.5 to generate synthetic robot training data
NVIDIA published a technical guide for parameter-efficient fine-tuning of its Cosmos Predict 2.5 world model using LoRA and DoRA adapters. The method allows teams to adapt the 2B-parameter model to robot manipulation tasks on a single 80GB GPU, generating synthetic training trajectories from just 92 demonstration videos.
Technical implementation
Cosmos Predict 2.5 is a video generation model that produces physically plausible videos from text, images, or video clips. The fine-tuning approach injects small trainable adapter modules into the frozen base model, avoiding the cost and catastrophic forgetting risks of full fine-tuning.
The implementation targets three components: a VAE for encoding videos to latents, a text encoder, and a DiT (Diffusion Transformer) for latent-space diffusion. All base weights remain frozen. LoRA adapters are injected only into the DiT's attention projections (to_q, to_k, to_v, to_out.0) and feedforward layers (ff.net.0.proj, ff.net.2). Trainable LoRA parameters are upcast to float32 for numerical stability under bf16 mixed precision.
Training approach
The model uses rectified flow, training to predict the velocity that linearly transports noise to clean data. At timestep t, the model constructs a noisy interpolation xt = σt·noise + (1−σt)·clean and learns to predict the target velocity noise − clean via MSE loss. The first two video frames serve as conditioning and receive no noise.
NVIDIA's reference training uses 92 robot manipulation videos with text prompts describing pick-and-place tasks, evaluated against 50 (prompt, image) pairs. The VideoDataset loader samples random contiguous windows of frames from longer videos each epoch for temporal augmentation.
Hardware and configuration
Minimum requirements: one 80GB GPU for single-GPU training. NVIDIA recommends 8× H100s for faster iteration. The guide provides a training script using diffusers and accelerate libraries with support for both single- and multi-GPU configurations.
Switching from LoRA to DoRA requires only setting use_dora=True in the LoraConfig. DoRA decomposes each weight into magnitude and direction before applying the low-rank update, with no other training loop changes required.
The optimizer is AdamW with linear warmup over scheduler_warm_up_steps, peaking at scheduler_f_max × learning_rate, then linear decay to scheduler_f_min × learning_rate. Checkpoints save as pytorch_lora_weights.safetensors files every specified number of epochs.
What this means
Parameter-efficient fine-tuning addresses a critical bottleneck in robot learning: collecting real-world demonstration data is slow and expensive. This guide provides a practical path to generate synthetic training data by adapting a general-purpose world model to specific robotic tasks with modest compute requirements. The approach keeps adapter files small and portable, enabling teams to swap different domain adapters at inference time without maintaining separate full model copies. The 92-video training set demonstrates that meaningful domain adaptation is possible with limited data when starting from a capable foundation model.
Related Articles
NVIDIA Releases GR00T N1.7, 3B-Parameter Open-Source Humanoid Robot Model Trained on 20,854 Hours of Human Video
NVIDIA released GR00T N1.7, a 3-billion parameter open-source Vision-Language-Action model for humanoid robots with commercial licensing. The model was trained on 20,854 hours of human egocentric video data and demonstrates the first documented scaling law for robot dexterity, where increasing human video data from 1,000 to 20,000 hours more than doubles task completion rates.
NVIDIA releases Nemotron-3-Nano-Omni-30B, a 31B-parameter multimodal model with 256K context and reasoning mode
NVIDIA released Nemotron-3-Nano-Omni-30B-A3B, a multimodal large language model with 31 billion parameters that processes video, audio, images, and text with up to 256K token context. The model uses a Mamba2-Transformer hybrid Mixture of Experts architecture and supports chain-of-thought reasoning mode.
NVIDIA Releases Nemotron 3 Nano Omni: 31B Multimodal Model With 256K Context and Reasoning Mode
NVIDIA released Nemotron 3 Nano Omni, a 31B parameter (30B active, 3B per token) multimodal model supporting video, audio, image, and text inputs. The model features a 256K token context window, reasoning mode with chain-of-thought, and tool calling capabilities.
NVIDIA Releases Nemotron 3 Nano Omni: 31B-Parameter Multimodal Model with 256K Context and Reasoning Mode
NVIDIA has released Nemotron 3 Nano Omni 30B-A3B, a multimodal large language model with 31 billion parameters using a Mamba2-Transformer hybrid Mixture of Experts architecture. The model supports video, audio, image, and text inputs with a 256K token context window and includes a dedicated reasoning mode with chain-of-thought capabilities.
Comments
Loading...