model releaseCohere

Cohere releases 2B open-source speech model with 5.42% word error rate

TL;DR

Cohere has released Transcribe, a 2 billion parameter open-source automatic speech recognition model that the company claims tops the Hugging Face Open ASR Leaderboard with a 5.42% word error rate. The model supports 14 languages and is available under Apache 2.0 license, outperforming OpenAI's Whisper Large v3 and competing models on both accuracy and throughput metrics.

1 min read
0

Cohere releases 2B open-source speech model with 5.42% word error rate

Cohere has released Transcribe, a 2 billion parameter open-source automatic speech recognition (ASR) model. According to the company, it achieves a 5.42% average word error rate on the Hugging Face Open ASR Leaderboard, claiming the top position ahead of OpenAI's Whisper Large v3, ElevenLabs Scribe v2, and Qwen3-ASR-1.7B.

Model specifications and performance

Transcribe supports 14 languages including English, German, French, and Japanese. Cohere claims the model also delivers the best throughput among similarly-sized competitors, a critical metric for production deployments where both latency and accuracy matter.

The model is available for download under the Apache 2.0 open-source license from Hugging Face, making it freely usable for commercial and non-commercial applications. It can also be accessed through Cohere's API and the company's Model Vault platform for users preferring cloud-based inference.

Deployment and integration plans

Cohere plans to integrate Transcribe into its North AI agent platform in the future. The release positions Transcribe as a viable alternative to proprietary speech models for developers and organizations seeking open-source ASR capabilities.

The 2B parameter size represents a middle ground: larger than many mobile-optimized models but smaller than massive academic benchmarks, suggesting practical hardware requirements for deployment.

What this means

Cohere's Transcribe release adds competition to the speech recognition market where Whisper has dominated open-source discussions. The claimed 5.42% WER and Apache 2.0 licensing remove licensing friction for commercial applications. However, independent verification of leaderboard results is essential—published benchmarks sometimes reflect narrow test conditions rather than real-world performance across diverse audio conditions, accents, and domains. The model's multilingual support and throughput claims warrant direct comparison testing before large-scale adoption decisions.

Related Articles

model release

Anthropic confirms leaked model represents major reasoning advance after security breach

A data breach at Anthropic exposed internal documents detailing an unreleased AI model the company describes as its most powerful to date. Anthropic confirmed it is already testing the model with select customers, claiming significant advances in reasoning, coding, and cybersecurity. The breach resulted from a misconfiguration in Anthropic's content management system that automatically made ~3,000 uploaded files publicly accessible.

model release

Chroma releases Context-1, a 20B parameter retrieval agent for complex multi-hop search

Chroma has released Context-1, a 20B parameter Mixture of Experts model trained specifically for retrieval tasks that require multi-hop reasoning. The model decomposes complex queries into subqueries, performs parallel tool calls, and actively prunes its own context mid-search—achieving comparable performance to frontier models at a fraction of the cost and up to 10x faster inference speed.

model release

Gemini 3.1 Flash Live scores 95.9% on Big Bench Audio, Google's fastest voice model

Google has released Gemini 3.1 Flash Live, its new voice and audio AI model, scoring 95.9% on the Big Bench Audio Benchmark at high thinking levels—second only to Step-Audio R1.1 Realtime at 97.0%. Response times range from 0.96 seconds at minimal thinking to 2.98 seconds at high thinking, with pricing held at $0.35 per hour of audio input and $1.40 per hour of audio output.

model release

Mistral releases Voxtral-4B-TTS-2603, open-weights text-to-speech model for production voice agents

Mistral AI released Voxtral-4B-TTS-2603, an open-weights text-to-speech model designed for production voice agents. The 4B-parameter model supports 9 languages, 20 preset voices, achieves 70ms latency at concurrency 1 on a single NVIDIA H200, and requires only 16GB GPU memory.

Comments

Loading...