researchOpenAI

Go-Browse trains 7B model to beat GPT-4o mini on web navigation tasks

Researchers propose Go-Browse, a method for training web agents through structured exploration that frames data collection as graph search. A 7B parameter language model fine-tuned on 10K trajectories achieves 21.7% success on the WebArena benchmark, outperforming GPT-4o mini by 2.4 percentage points.

March 5, 2026 · 1:25 AM2 min read

Go-Browse Trains 7B Model to Beat GPT-4o Mini on Web Navigation

A new research paper introduces Go-Browse, a method for automatically collecting web agent training data at scale through structured environment exploration. The approach addresses a fundamental limitation in digital agents: their inability to understand unfamiliar web environments and navigate efficiently toward task completion.

Method and Dataset

Go-Browse frames data collection as a graph search problem, enabling information reuse across multiple exploration episodes. This structured approach allows efficient exploration without redundant interactions.

The method was instantiated on the WebArena benchmark, producing:

10,000 successful task-solving trajectories
40,000 total interaction steps
Coverage across 100 different URLs

The dataset captures diverse, realistic web navigation scenarios needed for robust agent training.

Performance Results

A 7-billion parameter language model fine-tuned on the Go-Browse dataset achieved:

21.7% success rate on WebArena
2.4 percentage point improvement over GPT-4o mini
2.9 percentage point improvement over current state-of-the-art results for sub-10B parameter models

This represents the strongest performance to date for language models under 10 billion parameters on web navigation tasks, demonstrating that structured training data collection can compete with significantly larger proprietary models on specific benchmarks.

Technical Significance

The core innovation lies in treating data collection as a graph search problem rather than random interaction. This approach allows:

Reuse of exploration knowledge across episodes, reducing redundant interactions
Systematic coverage of web environment states
Scalable data generation without proportional increase in computational cost
Realistic trajectories that reflect actual web navigation patterns

The structured exploration framework differs from naive agent rollouts, which often result in inefficient, repetitive interactions and limited coverage of diverse web scenarios.

What This Means

Go-Browse demonstrates that specialized training data collection methods can meaningfully improve web agent performance without requiring larger model sizes. For practitioners building web agents, this suggests that data collection strategy—not just model scale—is critical for benchmark performance.

The 2.4% improvement over GPT-4o mini is notable because it shows open-weight 7B models, when properly trained on curated data, can match or exceed proprietary reasoning models on this task. However, the 21.7% absolute success rate indicates web navigation remains a challenging problem; agents still fail on roughly 4 of 5 WebArena tasks.

The research opens questions about data efficiency: whether similar gains could be achieved with smaller datasets, and how well the structured exploration approach generalizes beyond WebArena to real-world web environments outside the benchmark's controlled 100-URL scope.

ArXiv paper: 2506.03533v2

Source: arxiv.org ↗

web-agents language-models training-data structured-exploration webarena-benchmark 7b-models agent-training