LLM News

Every LLM release, update, and milestone.

Filtered by:agent-limitations✕ clear
benchmark

WebDS benchmark reveals 80% performance gap between AI agents and humans on real-world data science tasks

Researchers introduced WebDS, the first end-to-end web-based data science benchmark comprising 870 tasks across 29 websites. Current state-of-the-art LLM agents achieve only 15-20% success rates on these complex, multi-step data acquisition and analysis tasks, while humans reach approximately 90% accuracy, revealing significant gaps in agent capabilities.

2 min readvia arxiv.org