model releaseCohere

Cohere releases 2B open-source speech model with 5.42% word error rate

TL;DR

Cohere has released Transcribe, a 2 billion parameter open-source automatic speech recognition model that the company claims tops the Hugging Face Open ASR Leaderboard with a 5.42% word error rate. The model supports 14 languages and is available under Apache 2.0 license, outperforming OpenAI's Whisper Large v3 and competing models on both accuracy and throughput metrics.

1 min read
0

Cohere releases 2B open-source speech model with 5.42% word error rate

Cohere has released Transcribe, a 2 billion parameter open-source automatic speech recognition (ASR) model. According to the company, it achieves a 5.42% average word error rate on the Hugging Face Open ASR Leaderboard, claiming the top position ahead of OpenAI's Whisper Large v3, ElevenLabs Scribe v2, and Qwen3-ASR-1.7B.

Model specifications and performance

Transcribe supports 14 languages including English, German, French, and Japanese. Cohere claims the model also delivers the best throughput among similarly-sized competitors, a critical metric for production deployments where both latency and accuracy matter.

The model is available for download under the Apache 2.0 open-source license from Hugging Face, making it freely usable for commercial and non-commercial applications. It can also be accessed through Cohere's API and the company's Model Vault platform for users preferring cloud-based inference.

Deployment and integration plans

Cohere plans to integrate Transcribe into its North AI agent platform in the future. The release positions Transcribe as a viable alternative to proprietary speech models for developers and organizations seeking open-source ASR capabilities.

The 2B parameter size represents a middle ground: larger than many mobile-optimized models but smaller than massive academic benchmarks, suggesting practical hardware requirements for deployment.

What this means

Cohere's Transcribe release adds competition to the speech recognition market where Whisper has dominated open-source discussions. The claimed 5.42% WER and Apache 2.0 licensing remove licensing friction for commercial applications. However, independent verification of leaderboard results is essential—published benchmarks sometimes reflect narrow test conditions rather than real-world performance across diverse audio conditions, accents, and domains. The model's multilingual support and throughput claims warrant direct comparison testing before large-scale adoption decisions.

Related Articles

model release

Tencent Releases Hy3 Preview: Mixture-of-Experts Model with 262K Context and Configurable Reasoning

Tencent has released Hy3 preview, a Mixture-of-Experts model with a 262,144 token context window priced at $0.066 per million input tokens and $0.26 per million output tokens. The model features three configurable reasoning modes—disabled, low, and high—designed for agentic workflows and production environments.

model release

Google releases Gemini 3.1 Flash Lite with 1M context at $0.25 per million input tokens

Google has released Gemini 3.1 Flash Lite, a high-efficiency multimodal model with a 1,048,576 token context window priced at $0.25 per million input tokens and $1.50 per million output tokens. The model supports text, image, video, audio, and PDF inputs with four thinking levels for cost-performance optimization.

model release

IBM Releases Granite Embedding 311M R2 With 32K Context, 200+ Language Support

IBM released Granite Embedding 311M Multilingual R2, a 311-million parameter dense embedding model with 32,768-token context length and support for 200+ languages. The model scores 64.0 on Multilingual MTEB Retrieval (18 tasks), an 11.8-point improvement over its predecessor, and ships with ONNX and OpenVINO models for production deployment.

model release

IBM Releases Granite Speech 4.1 2B: 2-Billion-Parameter Multilingual Speech Model with Non-Autoregressive Variant

IBM has released Granite Speech 4.1 2B, a 2-billion-parameter speech-language model trained on 174,000 hours of audio for automatic speech recognition and translation across English, French, German, Spanish, Portuguese, and Japanese. The model introduces a dual-head CTC encoder and includes variants for speaker attribution and a novel non-autoregressive architecture for higher throughput.

Comments

Loading...