Mistral releases Voxtral TTS, open-source speech model for enterprise voice agents
Mistral AI released Voxtral TTS, an open-source text-to-speech model designed for enterprise voice agents and edge devices. The model supports nine languages, adapts custom voices from samples under five seconds, and achieves 90ms time-to-first-audio latency with a 6x real-time factor.
Mistral AI released Voxtral TTS on Thursday, an open-source text-to-speech model targeting enterprise voice applications and edge deployment. The model directly competes with ElevenLabs, Deepgram, and OpenAI's voice offerings.
Model Specifications
Voxtral TTS supports nine languages: English, French, German, Spanish, Dutch, Portuguese, Italian, Hindi, and Arabic. The model is based on Ministral 3B and designed for real-time performance with a time-to-first-audio (TTFA) of 90 milliseconds for a 10-second, 500-character sample. Its real-time factor (RTF) is 6x, meaning it can render a 10-second audio clip in approximately 1.6 seconds.
The model adapts to custom voices from samples shorter than five seconds while preserving accent, inflection, intonation, and speech irregularities. According to Mistral, it can switch between languages without losing voice characteristics—useful for dubbing and real-time translation applications.
Positioning and Capabilities
Pierre Stock, VP of science operations at Mistral AI, told TechCrunch that the company built "a small-sized speech model that can fit on a smartwatch, a smartphone, a laptop, or other edge devices" with "a cost that is a fraction of anything else on the market." The company emphasizes human-sounding output and real-time performance as core differentiators.
Mistral positions the open-source nature and customization flexibility as competitive advantages, allowing enterprises to tune models for specific use cases rather than relying on proprietary, managed solutions.
Strategic Context
Voxtral TTS complements Mistral's earlier 2026 releases of transcription models for batch and real-time processing. Stock indicated the company plans "an end-to-end platform that can handle multimodal streams of input, including audio, text, and image and output as well," suggesting a broader vision for agentic systems that process multiple modalities.
Pricing details were not disclosed. Availability for open-source use or commercial deployment terms remain unspecified.
What this means
Mistral is building a complete voice AI stack to compete with specialized speech companies and large language model providers offering voice capabilities. The open-source release strategy trades proprietary advantage for developer adoption and enterprise customization flexibility. The 90ms latency and edge-device focus suggest targeting real-time conversational agents rather than pre-rendered content, positioning against both traditional TTS vendors and API-based competitors.
Related Articles
Mistral Releases Mistral 3 Family: 675B-Parameter Large 3 MoE and Three Edge Models Under Apache 2.0
Mistral has released Mistral 3, including Mistral Large 3—a sparse mixture-of-experts model with 41B active and 675B total parameters—and three Ministral 3 edge models (3B, 8B, 14B). All models are released under Apache 2.0 license with multimodal capabilities and are available today on multiple platforms.
Mistral Releases Voxtral TTS: 4B Parameter Text-to-Speech Model at $0.016 per 1k Characters
Mistral AI has released Voxtral TTS, a 4B parameter text-to-speech model supporting 9 languages including English, French, German, Spanish, Dutch, Portuguese, Italian, Hindi, and Arabic. The model achieves 70ms latency for typical inputs and can clone voices from as little as 3 seconds of audio, priced at $0.016 per 1,000 characters.
Mistral releases Leanstral, open-source 6B-parameter proof assistant for Lean 4 under Apache 2.0
Mistral AI has released Leanstral, a sparse 120B model with 6B active parameters designed specifically for the Lean 4 proof assistant. The model is available under Apache 2.0 license with free API access and achieves a 26.3 FLTEval score at pass@2, outperforming Claude Sonnet 4.6 while costing $36 versus $549.
Mistral OCR 3 launches at $2 per 1,000 pages with 74% win rate over previous version
Mistral AI released Mistral OCR 3, a document extraction model priced at $2 per 1,000 pages ($1 with Batch API discount). The model achieves a 74% overall win rate over its predecessor on forms, scanned documents, complex tables, and handwriting according to internal benchmarks.
Comments
Loading...