researchAnthropic

Researchers achieve 141% improvement in agent training with just 312 human demonstrations

Researchers at GAIR-NLP have published PC Agent-E, an agent training framework that achieves a 141% relative improvement in computer use tasks starting from only 312 human-annotated trajectories. The method uses Claude 3.7 Sonnet to synthesize alternative action decisions, and the resulting model outperforms Claude 3.7 Sonnet by 10% on WindowsAgentArena-V2.

March 5, 2026 · 1:07 AM2 min read

Researchers Achieve 141% Improvement in Agent Training With Just 312 Human Demonstrations

A new research paper from GAIR-NLP introduces PC Agent-E, an efficient agent training framework that significantly reduces the data requirements for developing computer use agents.

Key Results

Starting with only 312 human-annotated computer use trajectories, the team augmented this small dataset by synthesizing diverse alternative action decisions using Claude 3.7 Sonnet. When trained on these enriched trajectories, PC Agent-E achieved a 141% relative improvement on the WindowsAgentArena-V2 benchmark.

More significantly, the model surpassed Claude 3.7 Sonnet by 10% on the same benchmark—a substantial margin given that it was trained on data synthesized from Claude 3.7 Sonnet itself.

Training Methodology

The framework combines two approaches:

Human demonstrations: 312 manually-annotated trajectories capturing diverse computer use patterns
AI data synthesis: Automated generation of alternative action decisions using Claude 3.7 Sonnet to create trajectory diversity without manual annotation

This hybrid approach proved more effective than either component alone. The method significantly outperformed direct distillation from Claude 3.7 Sonnet, indicating that the synthetic data augmentation strategy creates qualitatively different learning signals than simple model distillation.

WindowsAgentArena-V2 Benchmark

The researchers also released an improved version of WindowsAgentArena, which they used to evaluate their model. This benchmark appears designed specifically to measure computer use agent performance on Windows environments.

Implications

The 141% improvement metric and the ability to exceed Claude 3.7 Sonnet's performance with minimal human data suggest that trajectory diversity—rather than raw quantity of demonstrations—may be the critical factor in agent training. By leveraging a capable model to generate alternative decision paths through the same states, the framework effectively multiplies the information content of limited human annotations.

The authors have released code, data, and models publicly on GitHub at https://github.com/GAIR-NLP/PC-Agent-E.

What This Means

This research addresses a fundamental bottleneck in agent development: the cost of collecting large-scale human demonstrations. By showing that strategic data augmentation can exceed the performance of the augmentation source model itself, it provides a practical pathway for building capable computer use agents with limited human annotation budgets. The method's efficiency could accelerate development of agents for automation tasks, though the 312-trajectory baseline remains relatively small and results are benchmark-specific.

Source: arxiv.org ↗

agent-training computer-use data-synthesis anthropic-claude benchmark research trajectory-data model-distillation