Mistral AI Releases Voxtral: Apache 2.0 Speech Models with 32K Token Context at $0.001/Minute
Mistral AI released Voxtral, a family of open-source speech understanding models available in 24B and 3B parameter variants under Apache 2.0 license. The models support up to 32K token context (30 minutes of audio for transcription, 40 minutes for understanding) and are priced at $0.001 per minute via API—less than half the cost of comparable proprietary systems according to Mistral.
Mistral AI Releases Voxtral: Apache 2.0 Speech Models with 32K Token Context at $0.001/Minute
Mistral AI released Voxtral, a family of open-source speech understanding models available in 24B and 3B parameter variants. Both models are released under Apache 2.0 license and available via API starting at $0.001 per minute.
Technical Specifications
Voxtral comes in two versions:
- Voxtral Small (24B): Production-scale applications
- Voxtral Mini (3B): Local and edge deployments
Both models support 32K token context length, handling up to 30 minutes of audio for transcription or 40 minutes for understanding tasks. The API uses Voxtral Mini Transcribe, an optimized transcription variant.
Core Capabilities
The models include built-in Q&A and summarization without requiring separate ASR and language model chains. Voxtral supports native multilingual processing with automatic language detection across English, Spanish, French, Portuguese, Hindi, German, Dutch, and Italian.
Voxtral enables function-calling directly from voice input, allowing systems to trigger backend functions or API calls based on spoken commands without intermediate parsing. The models retain the text understanding capabilities of their Mistral Small 3.1 language model backbone.
Benchmark Performance
According to Mistral AI, Voxtral outperforms Whisper large-v3 across all tested transcription tasks. The company claims Voxtral Small beats GPT-4o mini Transcribe and Gemini 2.5 Flash on all evaluated tasks, achieving state-of-the-art results on English short-form benchmarks and Mozilla Common Voice.
On the FLEURS multilingual benchmark, Mistral reports Voxtral Small surpasses Whisper on every language task, with particular strength in European languages. For audio understanding tasks, the company states Voxtral Small is competitive with GPT-4o-mini and Gemini 2.5 Flash, claiming state-of-the-art performance in speech translation.
Word error rates were measured across LibriSpeech, GigaSpeech, VoxPopuli, Switchboard, CHiME-4, SPGISpeech, and Earnings-21/22 datasets for English, plus Mozilla Common Voice 15.1 and FLEURS for multilingual evaluation.
Pricing and Availability
Mistral claims Voxtral Mini Transcribe outperforms OpenAI Whisper at less than half the price, while Voxtral Small matches ElevenLabs Scribe performance at less than half the cost. API pricing starts at $0.001 per minute.
Both models are available for download on Hugging Face. Voxtral will be integrated into Le Chat's voice mode over the coming weeks.
Enterprise Features
Mistral offers private deployment options for production-scale inference within customer infrastructure, including multi-GPU configurations and quantized builds. The company provides domain-specific fine-tuning services for legal, medical, and customer support applications.
Mistral is developing additional features including speaker segmentation, emotion detection, word-level timestamps, and non-speech audio recognition.
What This Means
Voxtral represents the first production-grade, open-source speech model with competitive benchmark performance against proprietary systems. The Apache 2.0 license and $0.001/minute pricing could significantly lower barriers for developers building voice-enabled applications, particularly in regulated industries requiring on-premises deployment. The 32K token context window addresses a key limitation in current open-source ASR systems for long-form audio processing.
Related Articles
Mistral AI Releases Small 4: 119B Parameter Open-Source Model with 256K Context Under Apache 2.0
Mistral AI has released Mistral Small 4, a 119B total parameter mixture-of-experts model with 256K context window and native multimodal capabilities. The model uses 128 experts with 4 active per token (6B active parameters) and is released under the Apache 2.0 license, marking Mistral's first unified model combining reasoning, multimodal, and coding capabilities.
Mistral Releases Mistral Large 3 with 675B Parameters and Three Ministral 3 Models Under Apache 2.0
Mistral AI has released Mistral 3, consisting of Mistral Large 3—a sparse mixture-of-experts model with 675B total parameters and 41B active parameters—and three Ministral 3 models at 3B, 8B, and 14B parameters. All models are released under the Apache 2.0 license with multimodal capabilities including image understanding.
Mistral releases Leanstral, 6B-parameter open-source model for Lean 4 formal proof verification
Mistral AI released Leanstral, the first open-source code agent designed specifically for Lean 4 formal proof verification. The model uses 6B active parameters in a sparse 120B architecture and is available under Apache 2.0 license with free API access.
Mistral Releases OCR API at $1 per 1,000 Pages, Claims 94.89% Accuracy on Document Benchmarks
Mistral AI has released an OCR API priced at $1 per 1,000 pages with batch inference costs approximately half that rate. The company claims 94.89% overall accuracy on internal benchmarks, ahead of GPT-4o (89.77%), Gemini 2.0 Flash (88.69%), and Azure OCR (89.52%). The model processes up to 2,000 pages per minute on a single node.
Comments
Loading...