speech-recognition
12 articles tagged with speech-recognition
ServiceNow Releases First Code-Switching ASR Benchmark: ElevenLabs Scribe V2 Leads with Lowest WER Across Four Language
ServiceNow released AU-Harness, the first comprehensive benchmark for code-switched speech recognition in enterprise voice agents, testing seven ASR systems including ElevenLabs, Gemini, and AssemblyAI. The benchmark covers 918 utterances across Spanish-English, French-English, Canadian French-English, and German-English, measuring Word Error Rate (WER), Semantic WER (SWER), and Answer Error Rate (AER). ElevenLabs Scribe V2 achieved the lowest WER across all language pairs, followed closely by AssemblyAI Universal-3 Pro.
NVIDIA Releases Nemotron 3.5 ASR: 600M-Parameter Streaming Speech Model for 40 Languages
NVIDIA released Nemotron 3.5 ASR, a 600M-parameter speech-to-text model supporting 40 language-locales from a single checkpoint. The model achieves 0.07 seconds to final transcript after speech ends and ranks 2nd in latency among streaming ASR models according to Artificial Analysis benchmarks.
Microsoft releases MAI-Thinking-1, its first reasoning model with 35B parameters
Microsoft released seven AI models at Build 2026, headlined by MAI-Thinking-1, its first reasoning model with 35 billion parameters. The company claims the model matches Anthropic's Claude Opus 4.6 on SWE Bench Pro coding benchmarks and beats Sonnet 4.61 in blind tests.
Mistral AI Releases Voxtral: Apache 2.0 Speech Models with 32K Token Context at $0.001/Minute
Mistral AI released Voxtral, a family of open-source speech understanding models available in 24B and 3B parameter variants under Apache 2.0 license. The models support up to 32K token context (30 minutes of audio for transcription, 40 minutes for understanding) and are priced at $0.001 per minute via API—less than half the cost of comparable proprietary systems according to Mistral.
OpenAI Makes Whisper Speech Recognition Available on OpenRouter at $0.006 per Minute
OpenAI's Whisper 1 automatic speech recognition model is now accessible through OpenRouter's API routing service. The model supports transcription and translation across 50+ languages from audio files up to 25 MB, priced at $0.006 per minute of audio.
Google Home April 2026 update reduces Gemini interruptions, improves speech recognition in noisy environments
Google Home's April 2026 update addresses Gemini voice assistant reliability issues. The update improves speech detection to reduce mid-sentence interruptions, speeds up responses to simple queries, and enhances music playlist recognition even when names are misspoken or in noisy environments.
Microsoft releases three in-house AI models for speech and images, signaling independence from OpenAI
Microsoft released public preview versions of three proprietary AI models: MAI-Transcribe-1 for speech recognition across 25 languages at 50% lower GPU cost than alternatives, MAI-Voice-1 for speech synthesis generating 60 seconds of audio in under a second, and MAI-Image-2 for text-to-image generation. The models are available exclusively through Microsoft Azure AI Foundry and already power Copilot, Bing, and PowerPoint.
Alibaba's Qwen3.5-Omni learns to write code from speech and video without explicit training
Alibaba has released Qwen3.5-Omni, an omnimodal model handling text, images, audio, and video with a 256,000-token context window. The model reportedly outperforms Google's Gemini 3.1 Pro on audio tasks with support for 74 languages in speech recognition, a 6x increase from its predecessor. An unexpected emergent capability: writing working code from spoken instructions and video input, which the team did not explicitly train.
Cohere releases 2B open-source speech model with 5.42% word error rate
Cohere has released Transcribe, a 2 billion parameter open-source automatic speech recognition model that the company claims tops the Hugging Face Open ASR Leaderboard with a 5.42% word error rate. The model supports 14 languages and is available under Apache 2.0 license, outperforming OpenAI's Whisper Large v3 and competing models on both accuracy and throughput metrics.
IBM releases Granite 4.0 1B Speech: multilingual model for edge devices
IBM has released Granite 4.0 1B Speech, a 1 billion parameter multilingual speech model designed for edge deployment. The model supports multiple languages and is optimized for devices with limited computational resources.
ElevenLabs and Google lead Artificial Analysis speech-to-text benchmark
Artificial Analysis has released an updated speech-to-text benchmark showing ElevenLabs and Google as top performers. The benchmark provides comparative analysis of current speech recognition systems across multiple models.
Apple Research Identifies 'Text-Speech Understanding Gap' Limiting LLM Speech Performance
Apple researchers have identified a fundamental limitation in speech-adapted large language models: they consistently underperform their text-based counterparts on language understanding tasks. The team terms this the 'text-speech understanding gap' and documents that speech-adapted LLMs lag behind both their original text versions and cascaded speech-to-text pipelines.