benchmark

WebDS benchmark reveals 80% performance gap between AI agents and humans on real-world data science tasks

Researchers introduced WebDS, the first end-to-end web-based data science benchmark comprising 870 tasks across 29 websites. Current state-of-the-art LLM agents achieve only 15-20% success rates on these complex, multi-step data acquisition and analysis tasks, while humans reach approximately 90% accuracy, revealing significant gaps in agent capabilities.

March 5, 2026 · 5:38 AM2 min read

WebDS Benchmark Reveals Massive Performance Gap in AI Agents on Real-World Data Science

Researchers have released WebDS, a new benchmark that exposes critical limitations in current AI agents when tackling realistic data science workflows. The benchmark comprises 870 web-based data science tasks spanning 29 diverse websites, from government data portals to news media platforms.

The Core Problem

Existing benchmarks fall into two categories: web-based benchmarks that test simple interactions without comprehensive tool use, and data science benchmarks that rely on static, structured datasets disconnected from real-world data acquisition challenges. WebDS bridges this gap by requiring agents to perform complex, end-to-end workflows that mirror actual data science work: finding internet-based data sources, synthesizing multimodal information from multiple locations, and generating analytical insights.

Performance Results

The benchmark reveals a stark performance gap:

Browser Use (SOTA agent): Achieves 15% task completion on WebDS, despite accomplishing 80% of tasks on WebVoyager
Human baseline: ~90% accuracy
Performance gap: Approximately 75 percentage points between current agents and human performance

The dramatic drop from WebVoyager to WebDS performance indicates that existing benchmarks are not capturing the full complexity of real-world data science work.

Why Agents Fail

Researchers identified specific failure modes in current LLM agents attempting WebDS tasks:

Poor information grounding: Agents struggle to properly anchor decisions in retrieved data
Repetitive behavior: Agents get stuck in loops, repeating failed actions
Shortcut-taking: Agents attempt to bypass actual task requirements rather than complete full workflows

These failures don't appear as prominently in simpler web benchmarks, suggesting that task complexity is essential for identifying practical limitations.

Benchmark Composition

WebDS covers diverse data science scenarios across heterogeneous data formats. Tasks require agents to:

Locate and extract relevant data from multiple web sources
Combine data from different formats and structures
Perform data cleaning and transformation
Generate analytical summaries and insights
Navigate complex website interfaces and information architecture

The inclusion of 29 different websites ensures that agents cannot exploit domain-specific shortcuts or memorized interaction patterns.

What This Means

WebDS establishes a more rigorous evaluation standard for LLM-based data science agents. The benchmark's results indicate that current agents are fundamentally unprepared for the real-world complexity of data acquisition and analysis workflows. For researchers developing autonomous agents, this benchmark provides concrete evidence that improvements beyond current SOTA approaches are necessary. The 75-point gap to human performance suggests substantial room for advancement in agent reasoning, information processing, and task persistence. Organizations evaluating agent capabilities for data science applications should expect significant limitations with current technology.

Source: arxiv.org ↗

benchmark data-science web-agents llm-evaluation agent-limitations multi-step-reasoning research