NVIDIA Shows Task-Seeded Synthetic Data Boosts Nemotron-3 Nano by +11.1 on GPQA

TL;DR

NVIDIA demonstrated that task-seeded synthetic Q&A data improves model performance across multiple benchmarks in a 100B-token continuation experiment on Nemotron-3 Nano. The approach improved GPQA scores by +11.1 points, MMLU-Pro by +1.8, average code by +1.9, and commonsense understanding by +1.6.

June 4, 2026 · 11:35 AM2 min read

NVIDIA Shows Task-Seeded Synthetic Data Boosts Nemotron-3 Nano by +11.1 on GPQA

NVIDIA researchers published results showing that task-seeded synthetic Q&A generation improved Nemotron-3 Nano performance across multiple benchmarks in a 100B-token continuation experiment. According to NVIDIA, the approach delivered an +11.1 point improvement on GPQA, +1.8 on MMLU-Pro, +1.9 on average code tasks, and +1.6 on commonsense understanding, while maintaining stable average math performance.

The Approach

The pipeline uses training splits from approximately 70 public task datasets covering roughly 700 subtasks from lm-eval-harness as "capability seeds." NVIDIA emphasized that held-out evaluation and test data were excluded from generation.

The seed pool comprised two groups:

Knowledge-intensive tasks: 39 tasks, approximately 300 subtasks, roughly 3M seed samples covering factual, scientific, multilingual, and domain-specific Q&A
Reasoning-intensive tasks: 34 tasks, approximately 400 subtasks, roughly 1.5M seed samples covering analytical reasoning, logic, math, code, and commonsense reasoning

The five-stage process:

Collect seed tasks with suitable training splits
Normalize heterogeneous task records into unified JSONL schema
Generate similar examples that preserve underlying capabilities while changing content
Enrich answers with reasoning, knowledge, or context
Filter through schema checks, format validation, deduplication, and task-specific answer verification

Technical Details

NVIDIA stores semantic answer text rather than only option labels. For example, the system records "dirt trapped under the fingernails" instead of just "B" to provide clearer training signals.

Multiple-choice tasks are easier to verify directly, while generation-style tasks require more cautious task-specific handling, according to the researchers.

For Nemotron Ultra and Super pretraining runs, NVIDIA used a license-compatible subset of the generated data suitable for commercial model training.

The Transfer Learning Rationale

NVIDIA frames the approach through transfer learning across task families. The researchers argue that models can learn reusable behaviors from broad seed tasks and apply them to related applications and evaluations.

According to NVIDIA, the pipeline strengthens behaviors that appear across many tasks: identifying information needs, applying domain knowledge, separating plausible alternatives, following response constraints, executing multi-step reasoning, and grounding answers in context.

The researchers cite earlier evidence from Nemotron Nano pretraining, where AGIEval training data improved MMLU-Pro performance, suggesting that structured Q&A data from one task family can improve behavior outside the original task scope.

What This Means

This research demonstrates measurable gains from structured synthetic data generation during pretraining, not just post-training. The +11.1 point GPQA improvement is particularly notable for a 100B-token continuation experiment. The approach addresses a specific data quality problem: models may see abundant raw text during pretraining but still lack explicit examples of how information requests are structured and resolved. NVIDIA's results suggest that task-seeded synthetic data can fill this gap without requiring models to memorize evaluation datasets directly, though the technique requires careful filtering and verification infrastructure.

Source: huggingface.co ↗

nvidia synthetic-data pretraining nemotron research benchmarks data-generation

researchJuly 8, 2026

NVIDIA Releases 10 Trillion Tokens of Open Agentic Training Data, Launches Interactive Prompt Atlas

NVIDIA has released over 10 trillion pre-training tokens and millions of post-training samples as part of its Nemotron open data initiative for building AI agents. The release includes the Nemotron Post-Training v3 Prompt Atlas, an interactive visualization tool, and Nemotron-Personas dataset representing 2.4 billion people across 10 countries.

model releaseJuly 9, 2026

NVIDIA releases Nemotron-Labs-3-Puzzle-75B, compressed from 120B to 75B parameters with 2× throughput

NVIDIA has released Nemotron-Labs-3-Puzzle-75B-A9B, a compressed variant of Nemotron-3-Super that reduces the model from 120.7B total/12.8B active parameters to 75.3B total/9.3B active parameters. According to NVIDIA, the model achieves approximately 2× higher server throughput on a single 8×B200 node and increases sustainable 1M-token single-H100 concurrency from 1 request to 8 requests while maintaining strong accuracy across benchmarks.

model releaseJuly 4, 2026

NVIDIA releases Nemotron-Labs-TwoTower-30B: block-wise diffusion model claims 2.42× faster generation at 98.7% baseline

NVIDIA released Nemotron-Labs-TwoTower-30B-A3B-Base-BF16, a block-wise diffusion language model that generates text by denoising blocks of tokens in parallel rather than sequentially. According to NVIDIA, the model achieves 2.42× the wall-clock generation throughput of its autoregressive baseline while retaining 98.7% of aggregate benchmark quality.

researchJune 25, 2026

AI2 Research: Hybrid Models Excel at Content Words, Transformers Better at Token Repetition

Allen Institute for AI researchers conducted token-level analysis comparing their 7B-parameter Olmo 3 transformer and Olmo Hybrid models. The study finds hybrid architectures show a loss gap advantage of 0.04 on content words (nouns, verbs, adjectives) versus 0.02 on function words, while transformers match or exceed hybrids on repeated tokens and closing braces.

NVIDIA Shows Task-Seeded Synthetic Data Boosts Nemotron-3 Nano by +11.1 on GPQA

NVIDIA Shows Task-Seeded Synthetic Data Boosts Nemotron-3 Nano by +11.1 on GPQA

The Approach

Technical Details

The Transfer Learning Rationale

What This Means

Related Articles

NVIDIA Releases 10 Trillion Tokens of Open Agentic Training Data, Launches Interactive Prompt Atlas

NVIDIA releases Nemotron-Labs-3-Puzzle-75B, compressed from 120B to 75B parameters with 2× throughput

NVIDIA releases Nemotron-Labs-TwoTower-30B: block-wise diffusion model claims 2.42× faster generation at 98.7% baseline

AI2 Research: Hybrid Models Excel at Content Words, Transformers Better at Token Repetition

Comments