benchmark

Google AI Overviews reach 91% accuracy with Gemini 3, but 56% of answers lack verifiable sources

TL;DR

An independent study by AI startup Oumi found that Google's AI Overviews answered correctly 91% of the time with Gemini 3, up from 85% with Gemini 2, based on 4,326 searches using the SimpleQA benchmark. However, 56% of correct answers in Gemini 3 could not be verified through the linked sources—a significant increase from 37% in Gemini 2—and at Google's scale, a 9% error rate still translates to millions of wrong answers per hour.

3 min read
0

Study Methodology and Results

AI startup Oumi, on behalf of the New York Times, analyzed 4,326 Google searches using the SimpleQA benchmark—an industry-standard test developed by OpenAI. The analysis ran in two phases: October 2025 with Gemini 2 and February 2026 after the upgrade to Gemini 3. Results showed accuracy climbing from 85% to 91%.

While this appears favorable, the absolute numbers reveal the scale of errors: at Google's volume of search traffic, a 9% error rate translates to millions of incorrect answers delivered hourly. The study did not determine whether users would receive better information through traditional search results or alternative sources.

The Verifiability Problem

A critical finding contradicts the accuracy improvement: verifiability actually declined. Oumi checked whether Google's linked sources actually supported the answers provided. With Gemini 3, 56% of correct answers lacked grounding in the cited sources—meaning the linked websites didn't substantiate the information—compared to 37% with Gemini 2.

The quality of cited sources raises additional concerns. Facebook and Reddit ranked as the second and fourth most-cited sources across 5,380 total sources. Facebook appeared in 5% of correct answers and 7% of incorrect ones. The study notes Google may have incentive to favor sources less likely to pursue legal action over content usage.

The New York Times highlighted specific failures:

  • Classical Music Hall of Fame query: Google identified the correct listing site for Yo-Yo Ma but still claimed no record of his induction
  • North Carolina river question: Found the right tourism site but misread it, naming the Neuse River instead of the actual Little River
  • Bob Marley Museum opening date: Gave 1987 instead of 1986, aggregating conflicting data from Facebook, travel blogs, and Wikipedia

Google's Response

Google spokesperson Ned Adriance disputed the study, saying it contains "serious holes." He argued that SimpleQA—despite its name—was designed around particularly difficult questions where at least one AI model failed pre-screening, artificially inflating failure rates. He also noted the benchmark assumes scenarios without internet access, whereas Google's actual AI Overviews use live web search.

Google cited internal testing showing Gemini 3.1 Pro achieved a 38 percentage point reduction in hallucination rate compared to the earlier Gemini 3 tested in the study, likely a less capable Flash version. The company maintains that results improve significantly with web search access versus relying solely on model knowledge.

The Broader Concern: Web Ecosystem Impact

Beyond accuracy metrics, the study underscores a systemic issue: AI Overviews serve direct answers that discourage users from clicking through to external websites, fragmenting traffic from publishers. A 91% accuracy rate likely suffices for most users to skip verification altogether, further centralizing information discovery under Google's control.

Google has repeatedly denied studies showing AI Overviews reduce web traffic, while declining to publish its own traffic metrics. OpenAI showed greater transparency when launching ChatGPT's web features, initially stating concern for "the overall health of the ecosystem"—a position that "quietly faded," according to the analysis, as its search product scaled.

What This Means

The study reveals a widening gap between accuracy and trustworthiness. While Gemini 3 answers correctly more often, users have less ability to verify those answers independently, creating a paradox: better performance paired with reduced accountability. At scale, Google's error rate remains consequential, and the deteriorating verifiability suggests the company prioritizes answer generation over source integrity. This raises fundamental questions about whether centralized AI search serves users or simply optimizes for engagement and cost reduction.

Related Articles

benchmark

ChatGPT Images 2.0 scores 97% in head-to-head image generation benchmark against Google's Gemini Nano Banana at 85%

OpenAI's ChatGPT Images 2.0 scored 97% versus Google's Gemini Nano Banana at 85% in a nine-test image generation benchmark conducted by ZDNET. The tests measured capabilities including image restoration, text rendering, and prompt adherence, with Nano Banana losing points primarily for fabricating details and text errors.

benchmark

Google benchmarks AI models for Android development; names top performers

Google has completed benchmarking tests to evaluate which AI models perform best for Android app development. The company released results identifying top-performing models across coding tasks specific to the Android platform.

benchmark

Gemini handles video analysis across YouTube and 1.65GB local files, Claude fails entirely

In direct testing, Google's Gemini successfully analyzed video content from YouTube links and local files up to 1.65GB, accurately understanding context without audio or metadata. Anthropic's Claude cannot process video at all, while OpenAI's ChatGPT faces a 500MB file size limit without Codex assistance.

benchmark

OpenAI's GPT 5.4 ties Gemini 3.1 Pro at 72.4% on Google's Android coding benchmark

Google's Android Bench—a benchmark measuring AI model performance for Android app development—shows OpenAI's GPT 5.4 and Google's Gemini 3.1 Pro Preview tied at 72.4% in the latest April 2026 update. OpenAI's GPT 5.3-Codex ranks third at 67.7%, while Anthropic's Claude Opus 4.6 scores 66.6%.

Comments

Loading...