model releaseNVIDIA

NVIDIA Releases GR00T N1.7, 3B-Parameter Open-Source Humanoid Robot Model Trained on 20,854 Hours of Human Video

TL;DR

NVIDIA released GR00T N1.7, a 3-billion parameter open-source Vision-Language-Action model for humanoid robots with commercial licensing. The model was trained on 20,854 hours of human egocentric video data and demonstrates the first documented scaling law for robot dexterity, where increasing human video data from 1,000 to 20,000 hours more than doubles task completion rates.

April 17, 2026 · 3:51 PM2 min read

NVIDIA Isaac GR00T N1.7 — Quick Specs

Compare NVIDIA Isaac GR00T N1.7 with other models →

NVIDIA Releases GR00T N1.7, 3B-Parameter Open-Source Humanoid Robot Model Trained on 20,854 Hours of Human Video

NVIDIA released NVIDIA Isaac GR00T N1.7, a 3-billion parameter open-source Vision-Language-Action (VLA) model for humanoid robots. The model is commercially licensed and available now on Hugging Face and GitHub.

Model Architecture and Specifications

GR00T N1.7 uses an Action Cascade architecture with two distinct systems:

System 2 (Vision-Language Model): A Cosmos-Reason2-2B backbone processes image tokens and language instructions to produce high-level action tokens for task decomposition and multi-step reasoning
System 1 (Diffusion Transformer): A 32-layer DiT converts the VLM's output and live robot state into precise motor commands in real time

The model accepts RGB image frames at any resolution, natural language instructions, and robot proprioceptive state (joint positions, velocities, end-effector poses) as inputs. It outputs continuous-value action vectors mapped to the robot's degrees of freedom.

NVIDIA has validated the model across locomotion-manipulation, tabletop manipulation, and dexterous bimanual tasks on Unitree G1, Bimanual Manipulator YAM, and AGIBot Genie 1 platforms.

Training on Human Egocentric Video

The model was pre-trained on 20,854 hours of human egocentric video spanning more than 20 task categories, including manufacturing, retail, healthcare, and home environments. This represents a significant increase from the few thousand hours of robot teleoperation data used to train the previous N1.6 version.

According to NVIDIA, the training data came from sensorized human video with ego cameras, wrist cameras, and hand tracking. The company's research revealed what it describes as the first documented scaling law for robot dexterity: increasing human egocentric data from 1,000 to 20,000 hours more than doubles average task completion rates.

This scaling enables 22 degree-of-freedom hands to perform contact-rich tasks like small parts assembly and handling fragile components.

Deployment and Fine-Tuning

The model is commercially licensed and supports NVIDIA Ampere, Hopper, Lovelace, Blackwell, and Jetson platforms. Inference performance at 4 denoising steps with a single camera view is documented in the GitHub repository.

GR00T N1.7 supports fine-tuning on custom robot embodiments using the LeRobot dataset format. Pre-registered embodiments include UNITREE_G1, LIBERO_PANDA, and OXE_WIDOWX. The model is a drop-in replacement for N1.6 with existing embodiment configurations carrying over.

NVIDIA states the model is factory-floor ready for production deployments in material handling, packaging, and inspection tasks.

What This Means

GR00T N1.7 represents a shift in robot training methodology from teleoperation-based data collection to human video pre-training. The documented scaling law suggests that robot dexterity can improve predictably with more human video data, potentially reducing the need for expensive robot demonstration data. The commercial licensing and open-source release make the model immediately deployable in production environments, though real-world performance across diverse manufacturing settings remains to be independently verified. The 3B parameter size makes the model computationally feasible for edge deployment on robots while maintaining the reasoning capabilities needed for multi-step tasks.

Source: huggingface.co ↗

NVIDIA robotics vision-language-action open-source humanoid-robots foundation-model dexterity VLA

model releaseJuly 16, 2026

Nvidia Launches Cosmos 3 Edge World Model for Physical AI, Forms Japan Industrial Coalition

Nvidia released Cosmos 3 Edge, a world model designed for robots and vision AI agents to perceive and navigate physical environments in real time. The company announced partnerships with Japanese industrial giants including Fujitsu, Hitachi, and Kawasaki Heavy Industries as part of its physical AI expansion.

benchmarkJuly 16, 2026

NVIDIA Nemotron 3 Embed 8B Tops RTEB Leaderboard with 78.5% Score, 1B Variant Cuts Error Rate 27%

NVIDIA's Nemotron-3-Embed-8B-BF16 ranks #1 on the RTEB leaderboard with a 78.5% score, while the 1B variant reduces error rate by 27% over its predecessor. The open-weight models feature 32k context windows and production-ready deployment options including a Blackwell-optimized NVFP4 variant.

model releaseJuly 15, 2026

Mira Murati's Thinking Machines releases Inkling, 975B-parameter open-weight model trained on 45T tokens

Thinking Machines Lab released Inkling, a 975-billion-parameter mixture-of-experts model that uses 41 billion active parameters per task. The open-weight model was trained on 45 trillion tokens across text, image, audio, and video, marking the first public release from Mira Murati's AI startup.

model releaseJuly 11, 2026

Cohere releases 2B parameter Arabic speech recognition model with 25.9% average WER

Cohere and Cohere Labs released Cohere Transcribe Arabic, a 2B parameter automatic speech recognition model optimized for Arabic dialects and Arabic-English code-switching. The open-source model achieves a 25.9% average word error rate across major Arabic ASR benchmarks, outperforming models up to 30B parameters.

NVIDIA Releases GR00T N1.7, 3B-Parameter Open-Source Humanoid Robot Model Trained on 20,854 Hours of Human Video

NVIDIA Isaac GR00T N1.7 — Quick Specs

NVIDIA Releases GR00T N1.7, 3B-Parameter Open-Source Humanoid Robot Model Trained on 20,854 Hours of Human Video

Model Architecture and Specifications

Training on Human Egocentric Video

Deployment and Fine-Tuning

What This Means

Related Articles

Nvidia Launches Cosmos 3 Edge World Model for Physical AI, Forms Japan Industrial Coalition

NVIDIA Nemotron 3 Embed 8B Tops RTEB Leaderboard with 78.5% Score, 1B Variant Cuts Error Rate 27%

Mira Murati's Thinking Machines releases Inkling, 975B-parameter open-weight model trained on 45T tokens

Cohere releases 2B parameter Arabic speech recognition model with 25.9% average WER

Comments