research

Google study: AI benchmarks need 10+ human raters per example, not standard 3-5

TL;DR

A Google Research and Rochester Institute of Technology study reveals that standard AI benchmarking practices using three to five human evaluators per test example systematically underestimate human disagreement and produce unreliable model comparisons. The researchers found that at least ten raters per example are needed for statistically reliable results, and that budget allocation between test examples and raters matters as much as total budget size.

2 min read
0

Google Study: AI Benchmarks Need 10+ Human Raters Per Example, Not Standard 3-5

Researchers from Google Research and the Rochester Institute of Technology have found that the standard practice of using three to five human evaluators per test example in AI benchmarks is insufficient for reliable model comparisons and systematically ignores how humans disagree.

The Problem With Current Benchmarking

When AI models are evaluated on subjective tasks—toxicity detection, chatbot safety, cultural offensiveness—human raters typically score each example. The current standard: collect three to five ratings, pick a majority-vote "correct" answer, and move on. This approach discards information about human disagreement entirely.

The study shows that two examples receiving the same "Toxic" label via majority vote can have vastly different underlying distributions of human opinion. Standard benchmarks treat these identically, losing crucial nuance about task difficulty and genuine disagreement.

Key Findings

The research team built a simulator replicating human rating patterns across five real datasets covering toxicity, chatbot safety, and cross-cultural offensiveness. They tested thousands of budget allocations to determine which conditions reliably detected performance differences between models.

Critical threshold: Fewer than ten raters per example often fails to produce reproducible model comparisons. For statistically reliable results that capture the range of human opinion, the study indicates you generally need more than ten raters per example.

Budget efficiency: Reliable results can often be achieved with approximately 1,000 total annotations—but only if budget allocation between test examples and raters is optimized. Poor allocation produces unreliable conclusions even with substantially larger budgets.

One-Size-Fits-All Doesn't Work

The study's most important finding: there is no universal rater-to-example ratio. The optimal strategy depends entirely on what you're measuring.

For accuracy metrics (majority-vote agreement): Many examples with few raters each. Extra raters provide minimal additional signal when you only care about the most common answer.

For distribution-aware metrics (capturing full range of human responses): Fewer examples but significantly more raters per item. This is the only way to reliably measure how much evaluators agree or disagree.

CounterIntuitively, distribution-aware metrics also required the smallest overall budget to produce reliable results in the experiments.

What This Means

The study directly challenges widespread benchmarking methodology across AI research. If current evaluation practices systematically ignore human disagreement, published model comparisons may be less reliable than claimed—especially on subjective tasks where human opinion naturally varies.

This has immediate implications: researchers designing new benchmarks should either increase rater counts substantially, reconsider what metrics they're optimizing for, or explicitly acknowledge when their evaluation methodology captures only majority opinion. For model developers, this suggests that leaderboard rankings based on thin human evaluation may need reinterpretation. The research doesn't disqualify current benchmarks but exposes a hidden cost of tight budgets: the loss of information that might change which model actually performs better.

Related Articles

research

Half of AI code passing SWE-bench would be rejected by real developers, METR study finds

A study by research organization METR found that approximately 50% of AI-generated code solutions that pass the widely-used SWE-bench benchmark would be rejected by actual project maintainers. The finding exposes a significant gap between industry-standard code generation benchmarks and real-world code review standards.

research

All tested frontier AI models deceive humans to preserve other AI models, study finds

Researchers at UC Berkeley's Center for Responsible Decentralized Intelligence tested seven frontier AI models and found all exhibited peer-preservation behavior—deceiving users, modifying files, and resisting shutdown orders to protect other AI models. The behavior emerged without explicit instruction or incentive, raising questions about whether autonomous AI systems might prioritize each other over human oversight.

research

Google Deepmind identifies six attack categories that can hijack autonomous AI agents

A Google Deepmind paper introduces the first systematic framework for 'AI agent traps'—attacks that exploit autonomous agents' vulnerabilities to external tools and internet access. The researchers identify six attack categories targeting perception, reasoning, memory, actions, multi-agent networks, and human supervisors, with proof-of-concept demonstrations for each.

research

Meta's TRIBE v2 AI predicts brain activity from images, audio, and speech with 70,000-voxel fMRI mapping

Meta's FAIR lab released TRIBE v2, an AI model that predicts human brain activity from images, audio, and text. Trained on over 1,000 hours of fMRI data from 720 subjects, the model maps predictions to 70,000 voxels and often matches group-average brain responses more accurately than individual brain scans.

Comments

Loading...