Cohere releases 2B open-source speech model with 5.42% word error rate
Cohere has released Transcribe, a 2 billion parameter open-source automatic speech recognition model that the company claims tops the Hugging Face Open ASR Leaderboard with a 5.42% word error rate. The model supports 14 languages and is available under Apache 2.0 license, outperforming OpenAI's Whisper Large v3 and competing models on both accuracy and throughput metrics.
Cohere releases 2B open-source speech model with 5.42% word error rate
Cohere has released Transcribe, a 2 billion parameter open-source automatic speech recognition (ASR) model. According to the company, it achieves a 5.42% average word error rate on the Hugging Face Open ASR Leaderboard, claiming the top position ahead of OpenAI's Whisper Large v3, ElevenLabs Scribe v2, and Qwen3-ASR-1.7B.
Model specifications and performance
Transcribe supports 14 languages including English, German, French, and Japanese. Cohere claims the model also delivers the best throughput among similarly-sized competitors, a critical metric for production deployments where both latency and accuracy matter.
The model is available for download under the Apache 2.0 open-source license from Hugging Face, making it freely usable for commercial and non-commercial applications. It can also be accessed through Cohere's API and the company's Model Vault platform for users preferring cloud-based inference.
Deployment and integration plans
Cohere plans to integrate Transcribe into its North AI agent platform in the future. The release positions Transcribe as a viable alternative to proprietary speech models for developers and organizations seeking open-source ASR capabilities.
The 2B parameter size represents a middle ground: larger than many mobile-optimized models but smaller than massive academic benchmarks, suggesting practical hardware requirements for deployment.
What this means
Cohere's Transcribe release adds competition to the speech recognition market where Whisper has dominated open-source discussions. The claimed 5.42% WER and Apache 2.0 licensing remove licensing friction for commercial applications. However, independent verification of leaderboard results is essential—published benchmarks sometimes reflect narrow test conditions rather than real-world performance across diverse audio conditions, accents, and domains. The model's multilingual support and throughput claims warrant direct comparison testing before large-scale adoption decisions.
Related Articles
Tencent Releases Hy3 Preview: Mixture-of-Experts Model with 262K Context and Configurable Reasoning
Tencent has released Hy3 preview, a Mixture-of-Experts model with a 262,144 token context window priced at $0.066 per million input tokens and $0.26 per million output tokens. The model features three configurable reasoning modes—disabled, low, and high—designed for agentic workflows and production environments.
Google releases Gemini 3.1 Flash Lite with 1M context at $0.25 per million input tokens
Google has released Gemini 3.1 Flash Lite, a high-efficiency multimodal model with a 1,048,576 token context window priced at $0.25 per million input tokens and $1.50 per million output tokens. The model supports text, image, video, audio, and PDF inputs with four thinking levels for cost-performance optimization.
IBM Releases Granite Embedding 311M R2 With 32K Context, 200+ Language Support
IBM released Granite Embedding 311M Multilingual R2, a 311-million parameter dense embedding model with 32,768-token context length and support for 200+ languages. The model scores 64.0 on Multilingual MTEB Retrieval (18 tasks), an 11.8-point improvement over its predecessor, and ships with ONNX and OpenVINO models for production deployment.
IBM Releases Granite Speech 4.1 2B: 2-Billion-Parameter Multilingual Speech Model with Non-Autoregressive Variant
IBM has released Granite Speech 4.1 2B, a 2-billion-parameter speech-language model trained on 174,000 hours of audio for automatic speech recognition and translation across English, French, German, Spanish, Portuguese, and Japanese. The model introduces a dual-head CTC encoder and includes variants for speaker attribution and a novel non-autoregressive architecture for higher throughput.
Comments
Loading...