benchmark

Gemini 3.5 Flash ranks 6th in Android coding benchmark at 3x cost of Gemini 3.1 Pro

TL;DR

Google's latest Android Bench results show Gemini 3.5 Flash ranking 6th with a 63.7% success rate, despite averaging $147.10 per benchmark run compared to Gemini 3.1 Pro Preview's $47.90. The newer model used 355.9 tokens per run versus 73.3 for its predecessor, while GPT 5.5 leads the benchmark at 74% success rate.

June 12, 2026 · 3:20 PM2 min read

Gemini 3.5 Flash ranks 6th in Android coding benchmark at 3x cost of Gemini 3.1 Pro

Google's Gemini 3.5 Flash placed sixth in the company's Android Bench coding benchmark, scoring 63.7% success rate while costing significantly more per run than competing models, including Google's own Gemini 3.1 Pro Preview.

Benchmark results

The Android Bench measures model performance across 10 runs of Android coding tasks, scoring each model by the percentage of cases it successfully solves. According to Google's latest results:

Top 5 performers:

GPT 5.5: 74% success rate, $134.20 per run, 64.7 tokens average
GPT 5.4: 72.4% success rate, $91.70 per run, 64.2 tokens
Gemini 3.1 Pro Preview: 72.4% success rate, $47.90 per run, 73.3 tokens
Claude Opus 4.7: 68.7% success rate, $124.30 per run, 90.0 tokens
Claude Opus 4.6: 66.6% success rate, $84.40 per run, 69.5 tokens

Gemini 3.5 Flash:

Score: 63.7% success rate
Average latency: 14.2 seconds
Average tokens: 355.9 per run
Average cost: $147.10 per run

The model performed 9 percentage points worse than Gemini 3.1 Pro Preview while costing more than triple ($147.10 vs $47.90) and consuming 4.9x more tokens (355.9 vs 73.3).

Cost and efficiency comparison

Gemini 3.5 Flash was positioned as a cheaper, faster alternative to Gemini 3.1 Pro. In Android coding benchmarks, it shows higher latency and significantly increased resource consumption. GPT 5.5 achieved similar per-run costs ($134.20) while using 5.5x fewer tokens than Gemini 3.5 Flash.

Open-weight models occupy the lower rankings, with DeepSeek V4 Pro offering the lowest cost at $13.70 per run with a 55.4% success rate.

Rounding out the top 10

GLM 5.1: 59.7% ($46.70)
Kimi K2.6: 58.6% ($42.50)
Claude Sonnet 4.6: 58.4% ($40.40)
DeepSeek V4 Pro: 55.4% ($13.70)
Claude Sonnet 4.5: 53.7% ($61.00)

Google has not yet published benchmark scores for Claude Opus 4.8 or Fable 5. The company removed GPT 5.3 Codex from the rankings since the previous update.

What this means

This benchmark reveals a performance-cost mismatch for Gemini 3.5 Flash in specialized Android development tasks, contradicting its general positioning as an efficient alternative. The results suggest model performance varies significantly by use case—while Gemini 3.5 Flash may excel at general tasks, Android coding appears to be a weak spot. For developers choosing coding models, the data shows GPT 5.5 and Gemini 3.1 Pro Preview deliver better value in this specific domain. The token consumption disparity (355.9 vs 73.3) indicates potential optimization issues in how Gemini 3.5 Flash approaches Android development problems.

Source: 9to5google.com ↗

gemini google benchmark coding android gemini-3-5-flash gpt-5 claude

benchmarkMay 27, 2026

Frontier AI Models Score Below 50% on First Enterprise IT Benchmark for Kubernetes Incident Response

Artificial Analysis and IBM Research have released ITBench-AA, the first benchmark evaluating AI models on enterprise Site Reliability Engineering tasks. Claude Opus 4.7 leads at 47%, followed by GPT-5.5 at 46% and Qwen3.7 Max at 42%—all frontier models score below 50% on Kubernetes incident response tasks requiring root-cause diagnosis across complex infrastructure.

benchmarkMay 11, 2026

Gemini handles video analysis across YouTube and 1.65GB local files, Claude fails entirely

In direct testing, Google's Gemini successfully analyzed video content from YouTube links and local files up to 1.65GB, accurately understanding context without audio or metadata. Anthropic's Claude cannot process video at all, while OpenAI's ChatGPT faces a 500MB file size limit without Codex assistance.

benchmarkApril 27, 2026

ChatGPT Images 2.0 scores 97% in head-to-head image generation benchmark against Google's Gemini Nano Banana at 85%

OpenAI's ChatGPT Images 2.0 scored 97% versus Google's Gemini Nano Banana at 85% in a nine-test image generation benchmark conducted by ZDNET. The tests measured capabilities including image restoration, text rendering, and prompt adherence, with Nano Banana losing points primarily for fabricating details and text errors.

benchmarkApril 16, 2026

Qwen3.6-35B-A3B Outperforms Claude Opus 4.7 on SVG Generation Test

In an informal SVG generation benchmark, Alibaba's Qwen3.6-35B-A3B model running locally via a 20.9GB quantized version outperformed Anthropic's newly released Claude Opus 4.7. The test, which asked models to generate SVG illustrations of pelicans and flamingos on bicycles, showed the smaller local model producing more accurate bicycle frames and more creative outputs.

Gemini 3.5 Flash ranks 6th in Android coding benchmark at 3x cost of Gemini 3.1 Pro

Gemini 3.5 Flash ranks 6th in Android coding benchmark at 3x cost of Gemini 3.1 Pro

Benchmark results

Cost and efficiency comparison

Rounding out the top 10

What this means

Related Articles

Frontier AI Models Score Below 50% on First Enterprise IT Benchmark for Kubernetes Incident Response

Gemini handles video analysis across YouTube and 1.65GB local files, Claude fails entirely

ChatGPT Images 2.0 scores 97% in head-to-head image generation benchmark against Google's Gemini Nano Banana at 85%

Qwen3.6-35B-A3B Outperforms Claude Opus 4.7 on SVG Generation Test

Comments