google-research
1 article tagged with google-research
April 5, 2026
research
Google study: AI benchmarks need 10+ human raters per example, not standard 3-5
A Google Research and Rochester Institute of Technology study reveals that standard AI benchmarking practices using three to five human evaluators per test example systematically underestimate human disagreement and produce unreliable model comparisons. The researchers found that at least ten raters per example are needed for statistically reliable results, and that budget allocation between test examples and raters matters as much as total budget size.