benchmark

WebDS benchmark reveals 80% performance gap between AI agents and humans on real-world data science tasks

Researchers introduced WebDS, the first end-to-end web-based data science benchmark containing 870 tasks across 29 websites requiring agents to acquire, clean, and analyze multimodal data from the internet. Current state-of-the-art LLM agents achieve only 15% success on WebDS tasks despite reaching 80% on simpler web benchmarks, while humans achieve 90% accuracy.

March 5, 2026 · 5:08 AM2 min read

WebDS: An End-to-End Benchmark for Web-based Data Science

A new benchmark exposes critical limitations in current AI agents: while leading models like Browser Use achieve 80% success rates on existing web benchmarks, they complete only 15% of tasks on WebDS, a newly introduced benchmark for real-world data science workflows.

The Problem WebDS Addresses

Existing web benchmarks test simplistic interactions and narrow tool-use capabilities. Traditional data science benchmarks focus on static, pre-cleaned datasets that don't reflect actual work. WebDS bridges this gap by testing end-to-end workflows that mirror production data science: finding data sources online, integrating multimodal information from different locations, cleaning messy data, and generating insights.

Benchmark Scale and Scope

WebDS comprises 870 tasks distributed across 29 diverse websites, ranging from structured government data portals to unstructured news media sources. The tasks require agents to navigate heterogeneous data formats and perform multi-step operations combining web navigation, data extraction, synthesis, and analysis.

Performance Gap: AI vs. Human

The results are stark:

Current AI agents: 15% success rate (tested on Browser Use, a state-of-the-art agent)
Human baseline: ~90% accuracy
Performance gap: 75 percentage points

This massive discrepancy persists despite the same agent achieving 80% on WebVoyager, a simpler web interaction benchmark.

Root Causes of Agent Failures

Researchers identified three primary failure modes:

Poor information grounding — agents fail to accurately extract and track information from web pages
Repetitive behavior — agents get stuck in loops, re-attempting failed actions without adaptation
Shortcut-taking — agents pursue incomplete solutions rather than following required workflows to completion

These failures suggest current agents lack robustness in reasoning about complex, real-world data workflows where each step's quality directly impacts downstream analysis.

Implications

WebDS reveals that benchmark progress in web interaction hasn't translated to practical data science capabilities. The 15% success rate on authentic tasks suggests current LLM agents are not yet suitable for autonomous data analytics workflows without significant human oversight.

The benchmark represents a meaningful step toward more rigorous evaluation of AI systems in practical domains. By exposing the gap between simplified benchmarks and real-world requirements, WebDS establishes a concrete target for researchers developing the next generation of autonomous AI systems.

What This Means

WebDS demonstrates that strong performance on existing benchmarks masks critical weaknesses in real-world applicability. The 75-percentage-point gap between current agents and human performance on authentic data science tasks indicates substantial work remains before AI can reliably handle end-to-end analytical workflows. For practitioners, this confirms that deploying AI agents for unsupervised data analytics remains premature. For researchers, WebDS provides a concrete, realistic testing ground that will likely drive improvements in agent reasoning, error recovery, and information synthesis over the coming months.

Source: arxiv.org ↗

benchmark data-science web-agents LLM-agents evaluation multi-step-reasoning tool-use arXiv