benchmark

OpenAI's GPT 5.4 ties Gemini 3.1 Pro at 72.4% on Google's Android coding benchmark

TL;DR

Google's Android Bench—a benchmark measuring AI model performance for Android app development—shows OpenAI's GPT 5.4 and Google's Gemini 3.1 Pro Preview tied at 72.4% in the latest April 2026 update. OpenAI's GPT 5.3-Codex ranks third at 67.7%, while Anthropic's Claude Opus 4.6 scores 66.6%.

April 9, 2026 · 5:20 PM2 min read

OpenAI's GPT 5.4 Ties Gemini 3.1 Pro at Top of Google's Android Coding Benchmark

Google's Android Bench—introduced in March 2026 as a resource for evaluating AI models in Android app development—released its first update today, showing OpenAI's latest model matching Google's flagship offering.

Benchmark Scores

GPT 5.4 and Gemini 3.1 Pro Preview both scored 72.4%, the highest on the list. OpenAI's GPT 5.3-Codex follows at 67.7%. Anthropic's Claude Opus 4.6 ranks fourth at 66.6%.

Complete April 2026 rankings:

GPT 5.4: 72.4% (new)
Gemini 3.1 Pro Preview: 72.4%
GPT 5.3-Codex: 67.7% (new)
Claude Opus 4.6: 66.6%
GPT-5.2 Codex: 62.5%
Claude Opus 4.5: 61.9%
Gemini 3 Pro Preview: 60.4%
Claude Sonnet 4.6: 58.4%
Claude Sonnet 4.5: 54.2%
Gemini 3 Flash Preview: 42%
Gemini 2.5 Flash: 16.1%

Methodology

Google's evaluation framework assesses model capabilities across Android development essentials: Jetpack Compose for UI construction, Coroutines and Flows for asynchronous programming, Room for data persistence, and Hilt for dependency injection. Testing of OpenAI's models occurred in mid-March 2026, prior to their public release this week.

Important Caveats

Google explicitly noted that benchmark results should not be treated as definitive. Real-world performance varies significantly based on workflow, pricing, integration ease, and specific use cases. The methodology measures controlled scenarios that may not reflect production development conditions.

The remainder of the ranking remained unchanged from the initial late-February test run, with no new models added beyond OpenAI's two entries.

Context

Google positioned Android Bench as a tool to help developers "be more productive" and ultimately "deliver higher quality apps across the Android ecosystem." The benchmark represents Google's effort to provide transparent model comparisons for a specific, high-impact use case.

What This Means

The tight tie between GPT 5.4 and Gemini 3.1 Pro suggests convergence at the frontier for code-generation tasks. Developers choosing between them will likely base decisions on factors beyond benchmark scores—pricing, API latency, context window size, and ecosystem integration. The substantial performance gap between top-ranked models (72.4%) and older versions (Gemini 2.5 Flash at 16.1%) indicates rapid model improvement in specialized coding tasks. For teams standardized on either Google or OpenAI infrastructure, this benchmark provides useful calibration without necessarily indicating a decisive winner.

Source: 9to5google.com ↗

google openai benchmark android-development gpt-5-4 gemini-3-1 coding-ai android-bench

benchmarkApril 27, 2026

ChatGPT Images 2.0 scores 97% in head-to-head image generation benchmark against Google's Gemini Nano Banana at 85%

OpenAI's ChatGPT Images 2.0 scored 97% versus Google's Gemini Nano Banana at 85% in a nine-test image generation benchmark conducted by ZDNET. The tests measured capabilities including image restoration, text rendering, and prompt adherence, with Nano Banana losing points primarily for fabricating details and text errors.

benchmarkApril 7, 2026

Google AI Overviews reach 91% accuracy with Gemini 3, but 56% of answers lack verifiable sources

An independent study by AI startup Oumi found that Google's AI Overviews answered correctly 91% of the time with Gemini 3, up from 85% with Gemini 2, based on 4,326 searches using the SimpleQA benchmark. However, 56% of correct answers in Gemini 3 could not be verified through the linked sources—a significant increase from 37% in Gemini 2—and at Google's scale, a 9% error rate still translates to millions of wrong answers per hour.

benchmarkMarch 6, 2026

Google benchmarks AI models for Android development; names top performers

Google has completed benchmarking tests to evaluate which AI models perform best for Android app development. The company released results identifying top-performing models across coding tasks specific to the Android platform.

benchmarkMarch 1, 2026

ElevenLabs and Google lead Artificial Analysis speech-to-text benchmark

Artificial Analysis has released an updated speech-to-text benchmark showing ElevenLabs and Google as top performers. The benchmark provides comparative analysis of current speech recognition systems across multiple models.