LLM News

Every LLM release, update, and milestone.

Filtered by:web-agents✕ clear

benchmark

WebDS benchmark reveals 80% performance gap between AI agents and humans on real-world data science tasks

Researchers introduced WebDS, the first end-to-end web-based data science benchmark comprising 870 tasks across 29 websites. Current state-of-the-art LLM agents achieve only 15-20% success rates on these complex, multi-step data acquisition and analysis tasks, while humans reach approximately 90% accuracy, revealing significant gaps in agent capabilities.

March 5, 2026 · 5:38 AM2 min read

benchmark data-science web-agents

via arxiv.org ↗

benchmark

WebDS benchmark reveals 80% performance gap between AI agents and humans on real-world data science tasks

Researchers introduced WebDS, the first end-to-end web-based data science benchmark containing 870 tasks across 29 websites requiring agents to acquire, clean, and analyze multimodal data from the internet. Current state-of-the-art LLM agents achieve only 15% success on WebDS tasks despite reaching 80% on simpler web benchmarks, while humans achieve 90% accuracy.

March 5, 2026 · 5:08 AM2 min read

benchmark data-science web-agents

via arxiv.org ↗

researchOpenAI

Go-Browse trains 7B model to beat GPT-4o mini on web navigation tasks

Researchers propose Go-Browse, a method for training web agents through structured exploration that frames data collection as graph search. A 7B parameter language model fine-tuned on 10K trajectories achieves 21.7% success on the WebArena benchmark, outperforming GPT-4o mini by 2.4 percentage points.

March 5, 2026 · 1:25 AM2 min read

web-agents language-models training-data

via arxiv.org ↗

research

Researchers model human intervention patterns to build more collaborative web agents

A new research paper introduces methods for predicting when humans will intervene in autonomous web agents by analyzing distinct interaction patterns. The work, which includes a dataset of 400 real-user web navigation trajectories with over 4,200 interleaved human-agent actions, shows that intervention-aware models improved agent usefulness by 26.5% in user studies.

February 20, 2026 · 3:22 AM2 min read

web-agents human-ai-collaboration intervention-modeling

via arxiv.org ↗