model releaseNVIDIA

NVIDIA Releases GR00T N1.7, 3B-Parameter Open-Source Humanoid Robot Model Trained on 20,854 Hours of Human Video

TL;DR

NVIDIA released GR00T N1.7, a 3-billion parameter open-source Vision-Language-Action model for humanoid robots with commercial licensing. The model was trained on 20,854 hours of human egocentric video data and demonstrates the first documented scaling law for robot dexterity, where increasing human video data from 1,000 to 20,000 hours more than doubles task completion rates.

2 min read
0

NVIDIA Releases GR00T N1.7, 3B-Parameter Open-Source Humanoid Robot Model Trained on 20,854 Hours of Human Video

NVIDIA released NVIDIA Isaac GR00T N1.7, a 3-billion parameter open-source Vision-Language-Action (VLA) model for humanoid robots. The model is commercially licensed and available now on Hugging Face and GitHub.

Model Architecture and Specifications

GR00T N1.7 uses an Action Cascade architecture with two distinct systems:

  • System 2 (Vision-Language Model): A Cosmos-Reason2-2B backbone processes image tokens and language instructions to produce high-level action tokens for task decomposition and multi-step reasoning
  • System 1 (Diffusion Transformer): A 32-layer DiT converts the VLM's output and live robot state into precise motor commands in real time

The model accepts RGB image frames at any resolution, natural language instructions, and robot proprioceptive state (joint positions, velocities, end-effector poses) as inputs. It outputs continuous-value action vectors mapped to the robot's degrees of freedom.

NVIDIA has validated the model across locomotion-manipulation, tabletop manipulation, and dexterous bimanual tasks on Unitree G1, Bimanual Manipulator YAM, and AGIBot Genie 1 platforms.

Training on Human Egocentric Video

The model was pre-trained on 20,854 hours of human egocentric video spanning more than 20 task categories, including manufacturing, retail, healthcare, and home environments. This represents a significant increase from the few thousand hours of robot teleoperation data used to train the previous N1.6 version.

According to NVIDIA, the training data came from sensorized human video with ego cameras, wrist cameras, and hand tracking. The company's research revealed what it describes as the first documented scaling law for robot dexterity: increasing human egocentric data from 1,000 to 20,000 hours more than doubles average task completion rates.

This scaling enables 22 degree-of-freedom hands to perform contact-rich tasks like small parts assembly and handling fragile components.

Deployment and Fine-Tuning

The model is commercially licensed and supports NVIDIA Ampere, Hopper, Lovelace, Blackwell, and Jetson platforms. Inference performance at 4 denoising steps with a single camera view is documented in the GitHub repository.

GR00T N1.7 supports fine-tuning on custom robot embodiments using the LeRobot dataset format. Pre-registered embodiments include UNITREE_G1, LIBERO_PANDA, and OXE_WIDOWX. The model is a drop-in replacement for N1.6 with existing embodiment configurations carrying over.

NVIDIA states the model is factory-floor ready for production deployments in material handling, packaging, and inspection tasks.

What This Means

GR00T N1.7 represents a shift in robot training methodology from teleoperation-based data collection to human video pre-training. The documented scaling law suggests that robot dexterity can improve predictably with more human video data, potentially reducing the need for expensive robot demonstration data. The commercial licensing and open-source release make the model immediately deployable in production environments, though real-world performance across diverse manufacturing settings remains to be independently verified. The 3B parameter size makes the model computationally feasible for edge deployment on robots while maintaining the reasoning capabilities needed for multi-step tasks.

Related Articles

model release

Tencent Releases HY-World 2.0: Open-Source Multi-Modal Model Generates 3D Worlds from Text and Images

Tencent has released HY-World 2.0, an open-source multi-modal world model that generates navigable 3D environments from text prompts, single images, multi-view images, or video. The model produces editable 3D assets including meshes and 3D Gaussian Splattings that can be directly imported into game engines like Unity and Unreal Engine.

model release

Baidu releases ERNIE-Image, an 8B parameter text-to-image model with strong text rendering capabilities

Baidu has released ERNIE-Image, an 8B parameter text-to-image generation model built on a single-stream Diffusion Transformer architecture. The model is designed for complex instruction following, text rendering, and structured image generation, and can run on consumer GPUs with 24GB VRAM.

model release

Alibaba Qwen Releases 35B Parameter Qwen3.6-35B-A3B Model with 262K Native Context Window

Alibaba Qwen has released Qwen3.6-35B-A3B, a 35-billion parameter mixture-of-experts model with 3 billion activated parameters and a 262,144-token native context window extendable to 1,010,000 tokens. The model scores 73.4 on SWE-bench Verified and features FP8 quantization with performance metrics nearly identical to the original model.

model release

OpenAI releases GPT-Rosalind, biology-focused LLM trained on 50 common research workflows

OpenAI has released GPT-Rosalind, a large language model trained specifically on 50 common biology workflows and major biological databases. Unlike broader science-focused models from competitors, GPT-Rosalind targets specialized biology tasks including pathway analysis, drug target prioritization, and cross-disciplinary research navigation.

Comments

Loading...