NVIDIA Shows Task-Seeded Synthetic Data Boosts Nemotron-3 Nano by +11.1 on GPQA
NVIDIA demonstrated that task-seeded synthetic Q&A data improves model performance across multiple benchmarks in a 100B-token continuation experiment on Nemotron-3 Nano. The approach improved GPQA scores by +11.1 points, MMLU-Pro by +1.8, average code by +1.9, and commonsense understanding by +1.6.
NVIDIA Shows Task-Seeded Synthetic Data Boosts Nemotron-3 Nano by +11.1 on GPQA
NVIDIA researchers published results showing that task-seeded synthetic Q&A generation improved Nemotron-3 Nano performance across multiple benchmarks in a 100B-token continuation experiment. According to NVIDIA, the approach delivered an +11.1 point improvement on GPQA, +1.8 on MMLU-Pro, +1.9 on average code tasks, and +1.6 on commonsense understanding, while maintaining stable average math performance.
The Approach
The pipeline uses training splits from approximately 70 public task datasets covering roughly 700 subtasks from lm-eval-harness as "capability seeds." NVIDIA emphasized that held-out evaluation and test data were excluded from generation.
The seed pool comprised two groups:
- Knowledge-intensive tasks: 39 tasks, approximately 300 subtasks, roughly 3M seed samples covering factual, scientific, multilingual, and domain-specific Q&A
- Reasoning-intensive tasks: 34 tasks, approximately 400 subtasks, roughly 1.5M seed samples covering analytical reasoning, logic, math, code, and commonsense reasoning
The five-stage process:
- Collect seed tasks with suitable training splits
- Normalize heterogeneous task records into unified JSONL schema
- Generate similar examples that preserve underlying capabilities while changing content
- Enrich answers with reasoning, knowledge, or context
- Filter through schema checks, format validation, deduplication, and task-specific answer verification
Technical Details
NVIDIA stores semantic answer text rather than only option labels. For example, the system records "dirt trapped under the fingernails" instead of just "B" to provide clearer training signals.
Multiple-choice tasks are easier to verify directly, while generation-style tasks require more cautious task-specific handling, according to the researchers.
For Nemotron Ultra and Super pretraining runs, NVIDIA used a license-compatible subset of the generated data suitable for commercial model training.
The Transfer Learning Rationale
NVIDIA frames the approach through transfer learning across task families. The researchers argue that models can learn reusable behaviors from broad seed tasks and apply them to related applications and evaluations.
According to NVIDIA, the pipeline strengthens behaviors that appear across many tasks: identifying information needs, applying domain knowledge, separating plausible alternatives, following response constraints, executing multi-step reasoning, and grounding answers in context.
The researchers cite earlier evidence from Nemotron Nano pretraining, where AGIEval training data improved MMLU-Pro performance, suggesting that structured Q&A data from one task family can improve behavior outside the original task scope.
What This Means
This research demonstrates measurable gains from structured synthetic data generation during pretraining, not just post-training. The +11.1 point GPQA improvement is particularly notable for a 100B-token continuation experiment. The approach addresses a specific data quality problem: models may see abundant raw text during pretraining but still lack explicit examples of how information requests are structured and resolved. NVIDIA's results suggest that task-seeded synthetic data can fill this gap without requiring models to memorize evaluation datasets directly, though the technique requires careful filtering and verification infrastructure.
Related Articles
NVIDIA releases Nemotron-Labs-Diffusion-14B with tri-mode decoding achieving 3.3x speed-up on GB200
NVIDIA released Nemotron-Labs-Diffusion-14B, a 14-billion parameter language model that supports three decoding modes by switching attention patterns during inference. The model achieves 850 tokens per second on GB200 hardware at concurrency 1, representing a 3.3x speed-up over standard autoregressive decoding and outperforming Qwen3-8B-Eagle3 by 2.2x in self-speculation mode.
NVIDIA releases Nemotron-3-Nano-Omni-30B, a 31B-parameter multimodal model with 256K context and reasoning mode
NVIDIA released Nemotron-3-Nano-Omni-30B-A3B, a multimodal large language model with 31 billion parameters that processes video, audio, images, and text with up to 256K token context. The model uses a Mamba2-Transformer hybrid Mixture of Experts architecture and supports chain-of-thought reasoning mode.
NVIDIA Releases Cosmos3-Super-Text2Image: 64B Parameter Model for Physical AI Applications
NVIDIA released Cosmos3-Super-Text2Image, a 64-billion parameter text-to-image generation model as part of its Cosmos3 collection of omnimodal world models. The model uses a Mixture-of-Transformers architecture combining autoregressive and diffusion transformers, designed for Physical AI applications including robotics and autonomous vehicles.
NVIDIA Releases Cosmos 3: 64B-Parameter Omnimodal World Model for Physical AI
NVIDIA released Cosmos 3, an omnimodal world foundation model platform for Physical AI spanning robotics, autonomous driving, and industrial environments. The flagship Cosmos3-Super variant contains 64 billion parameters and generates video, images, audio, and action commands from text, image, video, and action trajectory inputs using a Mixture-of-Transformers architecture.
Comments
Loading...