benchmark

OpenAI's GPT 5.4 ties Gemini 3.1 Pro at 72.4% on Google's Android coding benchmark

TL;DR

Google's Android Bench—a benchmark measuring AI model performance for Android app development—shows OpenAI's GPT 5.4 and Google's Gemini 3.1 Pro Preview tied at 72.4% in the latest April 2026 update. OpenAI's GPT 5.3-Codex ranks third at 67.7%, while Anthropic's Claude Opus 4.6 scores 66.6%.

2 min read
0

OpenAI's GPT 5.4 Ties Gemini 3.1 Pro at Top of Google's Android Coding Benchmark

Google's Android Bench—introduced in March 2026 as a resource for evaluating AI models in Android app development—released its first update today, showing OpenAI's latest model matching Google's flagship offering.

Benchmark Scores

GPT 5.4 and Gemini 3.1 Pro Preview both scored 72.4%, the highest on the list. OpenAI's GPT 5.3-Codex follows at 67.7%. Anthropic's Claude Opus 4.6 ranks fourth at 66.6%.

Complete April 2026 rankings:

  • GPT 5.4: 72.4% (new)
  • Gemini 3.1 Pro Preview: 72.4%
  • GPT 5.3-Codex: 67.7% (new)
  • Claude Opus 4.6: 66.6%
  • GPT-5.2 Codex: 62.5%
  • Claude Opus 4.5: 61.9%
  • Gemini 3 Pro Preview: 60.4%
  • Claude Sonnet 4.6: 58.4%
  • Claude Sonnet 4.5: 54.2%
  • Gemini 3 Flash Preview: 42%
  • Gemini 2.5 Flash: 16.1%

Methodology

Google's evaluation framework assesses model capabilities across Android development essentials: Jetpack Compose for UI construction, Coroutines and Flows for asynchronous programming, Room for data persistence, and Hilt for dependency injection. Testing of OpenAI's models occurred in mid-March 2026, prior to their public release this week.

Important Caveats

Google explicitly noted that benchmark results should not be treated as definitive. Real-world performance varies significantly based on workflow, pricing, integration ease, and specific use cases. The methodology measures controlled scenarios that may not reflect production development conditions.

The remainder of the ranking remained unchanged from the initial late-February test run, with no new models added beyond OpenAI's two entries.

Context

Google positioned Android Bench as a tool to help developers "be more productive" and ultimately "deliver higher quality apps across the Android ecosystem." The benchmark represents Google's effort to provide transparent model comparisons for a specific, high-impact use case.

What This Means

The tight tie between GPT 5.4 and Gemini 3.1 Pro suggests convergence at the frontier for code-generation tasks. Developers choosing between them will likely base decisions on factors beyond benchmark scores—pricing, API latency, context window size, and ecosystem integration. The substantial performance gap between top-ranked models (72.4%) and older versions (Gemini 2.5 Flash at 16.1%) indicates rapid model improvement in specialized coding tasks. For teams standardized on either Google or OpenAI infrastructure, this benchmark provides useful calibration without necessarily indicating a decisive winner.

Related Articles

benchmark

Google AI Overviews reach 91% accuracy with Gemini 3, but 56% of answers lack verifiable sources

An independent study by AI startup Oumi found that Google's AI Overviews answered correctly 91% of the time with Gemini 3, up from 85% with Gemini 2, based on 4,326 searches using the SimpleQA benchmark. However, 56% of correct answers in Gemini 3 could not be verified through the linked sources—a significant increase from 37% in Gemini 2—and at Google's scale, a 9% error rate still translates to millions of wrong answers per hour.

benchmark

Google benchmarks AI models for Android development; names top performers

Google has completed benchmarking tests to evaluate which AI models perform best for Android app development. The company released results identifying top-performing models across coding tasks specific to the Android platform.

benchmark

ElevenLabs and Google lead Artificial Analysis speech-to-text benchmark

Artificial Analysis has released an updated speech-to-text benchmark showing ElevenLabs and Google as top performers. The benchmark provides comparative analysis of current speech recognition systems across multiple models.

benchmark

Nvidia claims 291 MLPerf wins with 288-GPU setup; AMD MI355X crosses 1M tokens/sec

MLCommons published MLPerf Inference v6.0 results on April 1, 2026, with Nvidia, AMD, and Intel each claiming top spots in different configurations. Nvidia's 288-GPU GB300-NVL72 system achieved 2.49 million tokens per second on DeepSeek-R1, while AMD's MI355X crossed one million tokens per second for the first time. Direct comparisons remain difficult as each chipmaker targets different market segments and benchmarks.

Comments

Loading...