benchmark

ServiceNow Releases First Code-Switching ASR Benchmark: ElevenLabs Scribe V2 Leads with Lowest WER Across Four Language

TL;DR

ServiceNow released AU-Harness, the first comprehensive benchmark for code-switched speech recognition in enterprise voice agents, testing seven ASR systems including ElevenLabs, Gemini, and AssemblyAI. The benchmark covers 918 utterances across Spanish-English, French-English, Canadian French-English, and German-English, measuring Word Error Rate (WER), Semantic WER (SWER), and Answer Error Rate (AER). ElevenLabs Scribe V2 achieved the lowest WER across all language pairs, followed closely by AssemblyAI Universal-3 Pro.

2 min read
0

ServiceNow Releases First Code-Switching ASR Benchmark

ServiceNow has released AU-Harness, the first comprehensive benchmark for evaluating how automatic speech recognition (ASR) systems handle code-switched speech—when bilingual speakers seamlessly switch between languages mid-conversation. The benchmark addresses a gap in enterprise voice agent capabilities, where over half the world's population speaks multiple languages.

The Dataset

The benchmark contains 918 code-switched utterances across four language pairs:

  • Spanish-English: 259 records
  • French-English: 298 records
  • Canadian French-English: 188 records
  • German-English: 173 records

All utterances simulate real-world IT support and HR interactions, including password resets, VPN access requests, benefits inquiries, and device troubleshooting. ServiceNow generated code-switched text using GPT-5, synthesized audio with ElevenLabs Multilingual V2, and validated each utterance through native speaker linguists.

The data uses the non-English language as the matrix framing with English embedded at varying lengths. Utterances range from 12 to 40 words and contain at least three switchable content words.

Methodology

ServiceNow evaluated seven ASR systems using three metrics:

  1. Word Error Rate (WER): Standard transcription accuracy
  2. Semantic WER (SWER): Rate of semantically meaningful errors, judged by Gemma-4-31B
  3. Answer Error Rate (AER): Whether transcription errors prevent correct answers to three comprehension questions per utterance

The models tested:

  • AssemblyAI Universal 3-Pro
  • Deepgram Nova 3 Multilang
  • ElevenLabs Scribe V2
  • Google Gemini 3 Flash
  • Mistral Voxtral Small 24B-2507
  • Nvidia Parakeet TDT 0.6b V3
  • OpenAI Whisper Large V3 Turbo

Results

WER Rankings: ElevenLabs Scribe V2 achieved the lowest WER across all four language pairs. AssemblyAI Universal-3 Pro tied on Spanish-English and trailed by 0.02-0.13 percentage points on other pairs. Google Gemini 3 Flash ranked third, falling 0.12-0.14 points behind the leaders.

Deepgram Nova-3, Mistral Voxtral, and Nvidia Parakeet occupied middle ranks. OpenAI Whisper Large V3 Turbo performed worst with WER ranging from 0.16 to 0.61—ServiceNow attributes this to Whisper defaulting to translation rather than transcription when called without explicit language parameters on code-switched audio.

Semantic Performance: For meaning-preservation metrics (SWER and AER), Scribe V2 maintained first place. However, Gemini 3 Flash consistently outperformed AssemblyAI on AER despite lower raw transcription accuracy, pushing AssemblyAI to third. ServiceNow attributes Gemini's advantage on semantic metrics to its optimization as a Large Audio Language Model (LALM) for language understanding and reasoning.

What This Means

This benchmark reveals that code-switching performance varies significantly by model and language pair. The 0.45 point gap between Whisper and Scribe V2 demonstrates that not all ASR systems are production-ready for bilingual enterprise deployments.

The divergence between transcription accuracy (WER) and semantic accuracy (AER) shows that raw WER alone is insufficient for evaluating enterprise voice agents. Models optimized for language understanding can preserve meaning despite higher character-level errors—critical for downstream tasks like ticket routing and policy questions where semantic accuracy matters more than perfect transcription.

ServiceNow has open-sourced the AU-Harness benchmark and dataset, providing the first standardized evaluation framework for code-switched speech in enterprise settings. The benchmark is available on Hugging Face.

Related Articles

benchmark

QIMMA Arabic Leaderboard Discards 3.1% of ArabicMMLU Samples After Quality Validation

TII UAE released QIMMA, an Arabic LLM leaderboard that validates benchmark quality before evaluating models. The validation pipeline, using Qwen3-235B and DeepSeek-V3 plus human review, discarded 3.1% of ArabicMMLU samples and found systematic quality issues across 14 benchmarks.

benchmark

Frontier AI Models Score Below 50% on First Enterprise IT Benchmark for Kubernetes Incident Response

Artificial Analysis and IBM Research have released ITBench-AA, the first benchmark evaluating AI models on enterprise Site Reliability Engineering tasks. Claude Opus 4.7 leads at 47%, followed by GPT-5.5 at 46% and Qwen3.7 Max at 42%—all frontier models score below 50% on Kubernetes incident response tasks requiring root-cause diagnosis across complex infrastructure.

benchmark

IBM Research launches Open Agent Leaderboard, showing same models achieve different results based on agent architecture

IBM Research has launched the Open Agent Leaderboard, the first open benchmark that evaluates complete AI agent systems rather than just underlying models. The leaderboard reveals that agents using identical models can achieve significantly different success rates and costs depending on system architecture, with failed runs costing 20-54% more than successful ones.

benchmark

Gemini handles video analysis across YouTube and 1.65GB local files, Claude fails entirely

In direct testing, Google's Gemini successfully analyzed video content from YouTube links and local files up to 1.65GB, accurately understanding context without audio or metadata. Anthropic's Claude cannot process video at all, while OpenAI's ChatGPT faces a 500MB file size limit without Codex assistance.

Comments

Loading...