benchmark

ServiceNow Releases First Code-Switching ASR Benchmark: ElevenLabs Scribe V2 Leads with Lowest WER Across Four Language

TL;DR

ServiceNow released AU-Harness, the first comprehensive benchmark for code-switched speech recognition in enterprise voice agents, testing seven ASR systems including ElevenLabs, Gemini, and AssemblyAI. The benchmark covers 918 utterances across Spanish-English, French-English, Canadian French-English, and German-English, measuring Word Error Rate (WER), Semantic WER (SWER), and Answer Error Rate (AER). ElevenLabs Scribe V2 achieved the lowest WER across all language pairs, followed closely by AssemblyAI Universal-3 Pro.

June 9, 2026 · 7:50 PM2 min read

ServiceNow Releases First Code-Switching ASR Benchmark

ServiceNow has released AU-Harness, the first comprehensive benchmark for evaluating how automatic speech recognition (ASR) systems handle code-switched speech—when bilingual speakers seamlessly switch between languages mid-conversation. The benchmark addresses a gap in enterprise voice agent capabilities, where over half the world's population speaks multiple languages.

The Dataset

The benchmark contains 918 code-switched utterances across four language pairs:

Spanish-English: 259 records
French-English: 298 records
Canadian French-English: 188 records
German-English: 173 records

All utterances simulate real-world IT support and HR interactions, including password resets, VPN access requests, benefits inquiries, and device troubleshooting. ServiceNow generated code-switched text using GPT-5, synthesized audio with ElevenLabs Multilingual V2, and validated each utterance through native speaker linguists.

The data uses the non-English language as the matrix framing with English embedded at varying lengths. Utterances range from 12 to 40 words and contain at least three switchable content words.

Methodology

ServiceNow evaluated seven ASR systems using three metrics:

Word Error Rate (WER): Standard transcription accuracy
Semantic WER (SWER): Rate of semantically meaningful errors, judged by Gemma-4-31B
Answer Error Rate (AER): Whether transcription errors prevent correct answers to three comprehension questions per utterance

The models tested:

AssemblyAI Universal 3-Pro
Deepgram Nova 3 Multilang
ElevenLabs Scribe V2
Google Gemini 3 Flash
Mistral Voxtral Small 24B-2507
Nvidia Parakeet TDT 0.6b V3
OpenAI Whisper Large V3 Turbo

Results

WER Rankings: ElevenLabs Scribe V2 achieved the lowest WER across all four language pairs. AssemblyAI Universal-3 Pro tied on Spanish-English and trailed by 0.02-0.13 percentage points on other pairs. Google Gemini 3 Flash ranked third, falling 0.12-0.14 points behind the leaders.

Deepgram Nova-3, Mistral Voxtral, and Nvidia Parakeet occupied middle ranks. OpenAI Whisper Large V3 Turbo performed worst with WER ranging from 0.16 to 0.61—ServiceNow attributes this to Whisper defaulting to translation rather than transcription when called without explicit language parameters on code-switched audio.

Semantic Performance: For meaning-preservation metrics (SWER and AER), Scribe V2 maintained first place. However, Gemini 3 Flash consistently outperformed AssemblyAI on AER despite lower raw transcription accuracy, pushing AssemblyAI to third. ServiceNow attributes Gemini's advantage on semantic metrics to its optimization as a Large Audio Language Model (LALM) for language understanding and reasoning.

What This Means

This benchmark reveals that code-switching performance varies significantly by model and language pair. The 0.45 point gap between Whisper and Scribe V2 demonstrates that not all ASR systems are production-ready for bilingual enterprise deployments.

The divergence between transcription accuracy (WER) and semantic accuracy (AER) shows that raw WER alone is insufficient for evaluating enterprise voice agents. Models optimized for language understanding can preserve meaning despite higher character-level errors—critical for downstream tasks like ticket routing and policy questions where semantic accuracy matters more than perfect transcription.

ServiceNow has open-sourced the AU-Harness benchmark and dataset, providing the first standardized evaluation framework for code-switched speech in enterprise settings. The benchmark is available on Hugging Face.

Source: huggingface.co ↗

ASR speech-recognition benchmark code-switching multilingual ServiceNow ElevenLabs voice-agents

benchmarkApril 21, 2026

QIMMA Arabic Leaderboard Discards 3.1% of ArabicMMLU Samples After Quality Validation

TII UAE released QIMMA, an Arabic LLM leaderboard that validates benchmark quality before evaluating models. The validation pipeline, using Qwen3-235B and DeepSeek-V3 plus human review, discarded 3.1% of ArabicMMLU samples and found systematic quality issues across 14 benchmarks.

benchmarkJuly 24, 2026

Kimi K3 Scores 32% on Cyber Exploit Benchmark vs. 76% for Leading U.S. Models, Joint UK-US Study Finds

A joint evaluation by the UK AI Security Institute and U.S. Center for AI Standards and Innovation found Kimi K3 scores 32.2% on the ExploitBench benchmark versus 76.2% for leading U.S. models, though it beats China's GLM-5.2 at 24.4%. The gap may stem from Moonshot AI distilling Claude outputs that exclude advanced offensive cyber content.

benchmarkJuly 16, 2026

NVIDIA Nemotron 3 Embed 8B Tops RTEB Leaderboard with 78.5% Score, 1B Variant Cuts Error Rate 27%

NVIDIA's Nemotron-3-Embed-8B-BF16 ranks #1 on the RTEB leaderboard with a 78.5% score, while the 1B variant reduces error rate by 27% over its predecessor. The open-weight models feature 32k context windows and production-ready deployment options including a Blackwell-optimized NVFP4 variant.

benchmarkJune 12, 2026

Gemini 3.5 Flash ranks 6th in Android coding benchmark at 3x cost of Gemini 3.1 Pro

Google's latest Android Bench results show Gemini 3.5 Flash ranking 6th with a 63.7% success rate, despite averaging $147.10 per benchmark run compared to Gemini 3.1 Pro Preview's $47.90. The newer model used 355.9 tokens per run versus 73.3 for its predecessor, while GPT 5.5 leads the benchmark at 74% success rate.