LLM News

Every LLM release, update, and milestone.

Filtered by:benchmarking✕ clear

research

First benchmark for personalized deep research agents reveals gaps in current AI systems

Researchers introduced PDR-Bench, the first benchmark specifically designed to evaluate personalization in Deep Research Agents (DRAs). The benchmark pairs 50 research tasks across 10 domains with 25 authentic user profiles, creating 250 realistic queries that expose current limitations in how AI systems adapt to individual user contexts.

March 5, 2026 · 5:37 AM2 min read

deep-research-agents benchmarking personalization

via arxiv.org ↗

research

Researchers expose 'preference leakage' bias in LLM judging systems

Researchers have identified a contamination problem called preference leakage in LLM-as-a-judge evaluation systems, where judges systematically favor data generated by related models. The bias occurs when the judge LLM is the same as the generator, inherits from it, or belongs to the same model family—making it harder to detect than previous LLM evaluation biases.

March 5, 2026 · 5:09 AM2 min read

benchmarking llm-evaluation contamination

via arxiv.org ↗

research

Study questions whether OCR is still necessary for document extraction with modern MLLMs

A large-scale benchmarking study finds that modern multimodal large language models (MLLMs) can extract information from business documents nearly as well as traditional OCR+MLLM pipelines. The research introduces an automated error analysis framework and suggests that careful schema design and prompt engineering can further close the performance gap.

March 5, 2026 · 1:22 AM2 min read

multimodal-llms document-extraction ocr

via arxiv.org ↗

researchGoogle DeepMind

Google DeepMind argues chatbot ethics require same rigor as coding benchmarks

Google DeepMind is pushing for moral behavior in large language models to be evaluated with the same technical rigor applied to coding and math benchmarks. As LLMs take on roles like companions, therapists, and medical advisors, the research group argues current evaluation standards are insufficient.

February 20, 2026 · 4:39 AM2 min read

google-deepmind llm-safety ai-ethics

via technologyreview.com ↗