LLM News

Every LLM release, update, and milestone.

Filtered by:evaluation-framework✕ clear

benchmark

ObfusQAte framework reveals LLMs hallucinate when faced with obfuscated questions

Researchers have introduced ObfusQAte, a new benchmark framework designed to test large language model robustness on obfuscated factual questions. The framework reveals that leading LLMs exhibit significant failure rates and hallucination tendencies when presented with increasingly nuanced language variations.

March 5, 2026 · 5:38 AM2 min read

benchmark robustness factual-qa

via arxiv.org ↗

research

First benchmark for personalized deep research agents reveals gaps in current AI systems

Researchers introduced PDR-Bench, the first benchmark specifically designed to evaluate personalization in Deep Research Agents (DRAs). The benchmark pairs 50 research tasks across 10 domains with 25 authentic user profiles, creating 250 realistic queries that expose current limitations in how AI systems adapt to individual user contexts.

March 5, 2026 · 5:37 AM2 min read

deep-research-agents benchmarking personalization

via arxiv.org ↗