Google AI Overviews reach 91% accuracy with Gemini 3, but 56% of answers lack verifiable sources
An independent study by AI startup Oumi found that Google's AI Overviews answered correctly 91% of the time with Gemini 3, up from 85% with Gemini 2, based on 4,326 searches using the SimpleQA benchmark. However, 56% of correct answers in Gemini 3 could not be verified through the linked sources—a significant increase from 37% in Gemini 2—and at Google's scale, a 9% error rate still translates to millions of wrong answers per hour.
Study Methodology and Results
AI startup Oumi, on behalf of the New York Times, analyzed 4,326 Google searches using the SimpleQA benchmark—an industry-standard test developed by OpenAI. The analysis ran in two phases: October 2025 with Gemini 2 and February 2026 after the upgrade to Gemini 3. Results showed accuracy climbing from 85% to 91%.
While this appears favorable, the absolute numbers reveal the scale of errors: at Google's volume of search traffic, a 9% error rate translates to millions of incorrect answers delivered hourly. The study did not determine whether users would receive better information through traditional search results or alternative sources.
The Verifiability Problem
A critical finding contradicts the accuracy improvement: verifiability actually declined. Oumi checked whether Google's linked sources actually supported the answers provided. With Gemini 3, 56% of correct answers lacked grounding in the cited sources—meaning the linked websites didn't substantiate the information—compared to 37% with Gemini 2.
The quality of cited sources raises additional concerns. Facebook and Reddit ranked as the second and fourth most-cited sources across 5,380 total sources. Facebook appeared in 5% of correct answers and 7% of incorrect ones. The study notes Google may have incentive to favor sources less likely to pursue legal action over content usage.
The New York Times highlighted specific failures:
- Classical Music Hall of Fame query: Google identified the correct listing site for Yo-Yo Ma but still claimed no record of his induction
- North Carolina river question: Found the right tourism site but misread it, naming the Neuse River instead of the actual Little River
- Bob Marley Museum opening date: Gave 1987 instead of 1986, aggregating conflicting data from Facebook, travel blogs, and Wikipedia
Google's Response
Google spokesperson Ned Adriance disputed the study, saying it contains "serious holes." He argued that SimpleQA—despite its name—was designed around particularly difficult questions where at least one AI model failed pre-screening, artificially inflating failure rates. He also noted the benchmark assumes scenarios without internet access, whereas Google's actual AI Overviews use live web search.
Google cited internal testing showing Gemini 3.1 Pro achieved a 38 percentage point reduction in hallucination rate compared to the earlier Gemini 3 tested in the study, likely a less capable Flash version. The company maintains that results improve significantly with web search access versus relying solely on model knowledge.
The Broader Concern: Web Ecosystem Impact
Beyond accuracy metrics, the study underscores a systemic issue: AI Overviews serve direct answers that discourage users from clicking through to external websites, fragmenting traffic from publishers. A 91% accuracy rate likely suffices for most users to skip verification altogether, further centralizing information discovery under Google's control.
Google has repeatedly denied studies showing AI Overviews reduce web traffic, while declining to publish its own traffic metrics. OpenAI showed greater transparency when launching ChatGPT's web features, initially stating concern for "the overall health of the ecosystem"—a position that "quietly faded," according to the analysis, as its search product scaled.
What This Means
The study reveals a widening gap between accuracy and trustworthiness. While Gemini 3 answers correctly more often, users have less ability to verify those answers independently, creating a paradox: better performance paired with reduced accountability. At scale, Google's error rate remains consequential, and the deteriorating verifiability suggests the company prioritizes answer generation over source integrity. This raises fundamental questions about whether centralized AI search serves users or simply optimizes for engagement and cost reduction.
Related Articles
Google benchmarks AI models for Android development; names top performers
Google has completed benchmarking tests to evaluate which AI models perform best for Android app development. The company released results identifying top-performing models across coding tasks specific to the Android platform.
ARC-AGI-3 benchmark: frontier AI models score below 1%, humans solve all 135 tasks
The ARC Prize Foundation released ARC-AGI-3, an interactive benchmark requiring AI agents to explore environments, form hypotheses, and execute plans without instructions. All 135 environments were solved by untrained humans, yet frontier models—including Gemini 3.1 Pro Preview (0.37%), GPT 5.4 (0.26%), Opus 4.6 (0.25%), and Grok-4.20 (0.00%)—scored below 1%.
ElevenLabs and Google lead Artificial Analysis speech-to-text benchmark
Artificial Analysis has released an updated speech-to-text benchmark showing ElevenLabs and Google as top performers. The benchmark provides comparative analysis of current speech recognition systems across multiple models.
Nvidia claims 291 MLPerf wins with 288-GPU setup; AMD MI355X crosses 1M tokens/sec
MLCommons published MLPerf Inference v6.0 results on April 1, 2026, with Nvidia, AMD, and Intel each claiming top spots in different configurations. Nvidia's 288-GPU GB300-NVL72 system achieved 2.49 million tokens per second on DeepSeek-R1, while AMD's MI355X crossed one million tokens per second for the first time. Direct comparisons remain difficult as each chipmaker targets different market segments and benchmarks.
Comments
Loading...