evaluation-metrics

2 articles tagged with evaluation-metrics

April 5, 2026
research

Google study: AI benchmarks need 10+ human raters per example, not standard 3-5

A Google Research and Rochester Institute of Technology study reveals that standard AI benchmarking practices using three to five human evaluators per test example systematically underestimate human disagreement and produce unreliable model comparisons. The researchers found that at least ten raters per example are needed for statistically reliable results, and that budget allocation between test examples and raters matters as much as total budget size.

March 11, 2026
research

Half of AI code passing SWE-bench would be rejected by real developers, METR study finds

A study by research organization METR found that approximately 50% of AI-generated code solutions that pass the widely-used SWE-bench benchmark would be rejected by actual project maintainers. The finding exposes a significant gap between industry-standard code generation benchmarks and real-world code review standards.