Cohere releases 2B open-source speech model with 5.42% word error rate
Cohere has released Transcribe, a 2 billion parameter open-source automatic speech recognition model that the company claims tops the Hugging Face Open ASR Leaderboard with a 5.42% word error rate. The model supports 14 languages and is available under Apache 2.0 license, outperforming OpenAI's Whisper Large v3 and competing models on both accuracy and throughput metrics.
Cohere releases 2B open-source speech model with 5.42% word error rate
Cohere has released Transcribe, a 2 billion parameter open-source automatic speech recognition (ASR) model. According to the company, it achieves a 5.42% average word error rate on the Hugging Face Open ASR Leaderboard, claiming the top position ahead of OpenAI's Whisper Large v3, ElevenLabs Scribe v2, and Qwen3-ASR-1.7B.
Model specifications and performance
Transcribe supports 14 languages including English, German, French, and Japanese. Cohere claims the model also delivers the best throughput among similarly-sized competitors, a critical metric for production deployments where both latency and accuracy matter.
The model is available for download under the Apache 2.0 open-source license from Hugging Face, making it freely usable for commercial and non-commercial applications. It can also be accessed through Cohere's API and the company's Model Vault platform for users preferring cloud-based inference.
Deployment and integration plans
Cohere plans to integrate Transcribe into its North AI agent platform in the future. The release positions Transcribe as a viable alternative to proprietary speech models for developers and organizations seeking open-source ASR capabilities.
The 2B parameter size represents a middle ground: larger than many mobile-optimized models but smaller than massive academic benchmarks, suggesting practical hardware requirements for deployment.
What this means
Cohere's Transcribe release adds competition to the speech recognition market where Whisper has dominated open-source discussions. The claimed 5.42% WER and Apache 2.0 licensing remove licensing friction for commercial applications. However, independent verification of leaderboard results is essential—published benchmarks sometimes reflect narrow test conditions rather than real-world performance across diverse audio conditions, accents, and domains. The model's multilingual support and throughput claims warrant direct comparison testing before large-scale adoption decisions.
Related Articles
Anthropic confirms leaked model represents major reasoning advance after security breach
A data breach at Anthropic exposed internal documents detailing an unreleased AI model the company describes as its most powerful to date. Anthropic confirmed it is already testing the model with select customers, claiming significant advances in reasoning, coding, and cybersecurity. The breach resulted from a misconfiguration in Anthropic's content management system that automatically made ~3,000 uploaded files publicly accessible.
Chroma releases Context-1, a 20B parameter retrieval agent for complex multi-hop search
Chroma has released Context-1, a 20B parameter Mixture of Experts model trained specifically for retrieval tasks that require multi-hop reasoning. The model decomposes complex queries into subqueries, performs parallel tool calls, and actively prunes its own context mid-search—achieving comparable performance to frontier models at a fraction of the cost and up to 10x faster inference speed.
Gemini 3.1 Flash Live scores 95.9% on Big Bench Audio, Google's fastest voice model
Google has released Gemini 3.1 Flash Live, its new voice and audio AI model, scoring 95.9% on the Big Bench Audio Benchmark at high thinking levels—second only to Step-Audio R1.1 Realtime at 97.0%. Response times range from 0.96 seconds at minimal thinking to 2.98 seconds at high thinking, with pricing held at $0.35 per hour of audio input and $1.40 per hour of audio output.
Mistral releases Voxtral-4B-TTS-2603, open-weights text-to-speech model for production voice agents
Mistral AI released Voxtral-4B-TTS-2603, an open-weights text-to-speech model designed for production voice agents. The 4B-parameter model supports 9 languages, 20 preset voices, achieves 70ms latency at concurrency 1 on a single NVIDIA H200, and requires only 16GB GPU memory.
Comments
Loading...