LLM News

Every LLM release, update, and milestone.

Filtered by:web-agents✕ clear
benchmark

WebDS benchmark reveals 80% performance gap between AI agents and humans on real-world data science tasks

Researchers introduced WebDS, the first end-to-end web-based data science benchmark comprising 870 tasks across 29 websites. Current state-of-the-art LLM agents achieve only 15-20% success rates on these complex, multi-step data acquisition and analysis tasks, while humans reach approximately 90% accuracy, revealing significant gaps in agent capabilities.

2 min readvia arxiv.org
benchmark

WebDS benchmark reveals 80% performance gap between AI agents and humans on real-world data science tasks

Researchers introduced WebDS, the first end-to-end web-based data science benchmark containing 870 tasks across 29 websites requiring agents to acquire, clean, and analyze multimodal data from the internet. Current state-of-the-art LLM agents achieve only 15% success on WebDS tasks despite reaching 80% on simpler web benchmarks, while humans achieve 90% accuracy.

2 min readvia arxiv.org
research

Researchers model human intervention patterns to build more collaborative web agents

A new research paper introduces methods for predicting when humans will intervene in autonomous web agents by analyzing distinct interaction patterns. The work, which includes a dataset of 400 real-user web navigation trajectories with over 4,200 interleaved human-agent actions, shows that intervention-aware models improved agent usefulness by 26.5% in user studies.