benchmark

ElevenLabs and Google lead Artificial Analysis speech-to-text benchmark

TL;DR

Artificial Analysis has released an updated speech-to-text benchmark showing ElevenLabs and Google as top performers. The benchmark provides comparative analysis of current speech recognition systems across multiple models.

March 1, 2026 · 3:05 PM1 min read

ElevenLabs and Google Lead Updated Speech-to-Text Benchmark

Artificial Analysis has released an updated speech-to-text benchmark, with ElevenLabs and Google emerging as the dominant performers in speech recognition accuracy and reliability.

The benchmark evaluation tests current speech-to-text systems across multiple dimensions, comparing how well different models convert spoken audio into written text. Both ElevenLabs and Google demonstrate competitive performance at the top tier of the rankings.

Benchmark Details

The updated benchmark from Artificial Analysis provides comparative metrics across speech recognition providers. However, specific accuracy scores, model names tested, dataset composition, and detailed performance differentials between ElevenLabs and Google have not been disclosed in available sources.

ElevenLabs, primarily known for text-to-speech synthesis, has expanded into speech recognition capabilities. Google's speech-to-text service has been a standard offering through Google Cloud and native Android/web integration for years.

What This Means

The benchmark reinforces that speech recognition quality remains concentrated among well-resourced companies with large training datasets. ElevenLabs' competitive positioning in both directions of audio-text conversion suggests the company is building comprehensive speech processing capabilities. Google's continued dominance reflects its extensive audio data access through YouTube, Google Assistant, and cloud service users.

For developers and enterprises selecting speech-to-text providers, this benchmark offers independent evaluation data to guide integration decisions. The full benchmark details would be critical for understanding which system performs better in specific use cases (noise conditions, language support, latency requirements, accuracy thresholds).

Access the full Artificial Analysis benchmark for detailed scoring metrics and comparative analysis across all tested providers.

Source: the-decoder.com ↗

speech-to-text benchmark elevenlabs google artificial-analysis speech-recognition audio-ai

benchmarkApril 9, 2026

OpenAI's GPT 5.4 ties Gemini 3.1 Pro at 72.4% on Google's Android coding benchmark

Google's Android Bench—a benchmark measuring AI model performance for Android app development—shows OpenAI's GPT 5.4 and Google's Gemini 3.1 Pro Preview tied at 72.4% in the latest April 2026 update. OpenAI's GPT 5.3-Codex ranks third at 67.7%, while Anthropic's Claude Opus 4.6 scores 66.6%.

benchmarkApril 7, 2026

Google AI Overviews reach 91% accuracy with Gemini 3, but 56% of answers lack verifiable sources

An independent study by AI startup Oumi found that Google's AI Overviews answered correctly 91% of the time with Gemini 3, up from 85% with Gemini 2, based on 4,326 searches using the SimpleQA benchmark. However, 56% of correct answers in Gemini 3 could not be verified through the linked sources—a significant increase from 37% in Gemini 2—and at Google's scale, a 9% error rate still translates to millions of wrong answers per hour.

benchmarkMarch 6, 2026

Google benchmarks AI models for Android development; names top performers

Google has completed benchmarking tests to evaluate which AI models perform best for Android app development. The company released results identifying top-performing models across coding tasks specific to the Android platform.

benchmarkApril 21, 2026

QIMMA Arabic Leaderboard Discards 3.1% of ArabicMMLU Samples After Quality Validation

TII UAE released QIMMA, an Arabic LLM leaderboard that validates benchmark quality before evaluating models. The validation pipeline, using Qwen3-235B and DeepSeek-V3 plus human review, discarded 3.1% of ArabicMMLU samples and found systematic quality issues across 14 benchmarks.