Mistral Releases Voxtral TTS: 4B Parameter Text-to-Speech Model at $0.016 per 1k Characters
Mistral AI has released Voxtral TTS, a 4B parameter text-to-speech model supporting 9 languages including English, French, German, Spanish, Dutch, Portuguese, Italian, Hindi, and Arabic. The model achieves 70ms latency for typical inputs and can clone voices from as little as 3 seconds of audio, priced at $0.016 per 1,000 characters.
Mistral Releases Voxtral TTS: 4B Parameter Text-to-Speech Model at $0.016 per 1k Characters
Mistral AI has released Voxtral TTS, a 4B parameter text-to-speech model supporting 9 languages with voice cloning capabilities from as little as 3 seconds of audio.
Technical Specifications
The model consists of three components:
- 3.4B parameter transformer decoder backbone (built on Ministral 3B)
- 390M parameter flow-matching acoustic transformer
- 300M parameter neural audio codec
Voxtral TTS achieves 70ms model latency for typical inputs (10-second voice sample, 500 characters) with a real-time factor of approximately 9.7x. The model natively generates up to 2 minutes of audio, with the API handling longer generations through smart interleaving.
Language Support and Capabilities
Voxtral TTS supports 9 languages: English, French, German, Spanish, Dutch, Portuguese, Italian, Hindi, and Arabic. The model can adapt to custom voices using reference samples as short as 3 seconds, capturing voice characteristics including accent, inflections, intonations, and disfluencies.
The model demonstrates zero-shot cross-lingual voice adaptation despite not being explicitly trained for it. For example, it can generate English speech using a French voice prompt, producing natural-sounding French-accented English.
Architecture Details
Voxtral TTS uses a transformer-based, autoregressive, flow-matching architecture. The in-house codec processes audio causally using semantic VQ (8,192 vocabulary) and acoustic FSQ (36 dimensions, 21 levels) latent representations, producing them at 12.5Hz frame rate. The flow-matching transformer runs 16 function evaluations per audio frame to produce acoustic latents.
Performance Benchmarks
According to Mistral AI, human evaluations by native speakers show Voxtral TTS achieves superior naturalness compared to ElevenLabs Flash v2.5 while maintaining similar time-to-first-audio. The company claims performance parity with ElevenLabs v3 in quality, with support for emotion-steering.
In zero-shot custom voice evaluations across 9 languages, Mistral conducted side-by-side preference tests with 3 annotators per language pair, measuring naturalness, accent adherence, and acoustic similarity. The company claims Voxtral TTS outperformed ElevenLabs v2.5 Flash in this multilingual custom voice setting.
Pricing and Availability
Voxtral TTS is available via API at $0.016 per 1,000 characters. The model can be tested in Mistral Studio and Le Chat. An open-weight version with reference voices is available on Hugging Face under CC BY-NC 4.0 license.
What This Means
Mistral's entry into text-to-speech with a compact 4B parameter model and competitive pricing positions it against established players like ElevenLabs. The 70ms latency and 3-second voice cloning capability make it viable for real-time voice agent applications. The open-weight release under a non-commercial license follows Mistral's hybrid approach of offering both commercial API access and community model weights. Cross-lingual voice adaptation without explicit training is a notable capability that could simplify speech-to-speech translation pipelines.
Related Articles
Mistral Releases Mistral 3 Family: 675B-Parameter Large 3 MoE and Three Edge Models Under Apache 2.0
Mistral has released Mistral 3, including Mistral Large 3—a sparse mixture-of-experts model with 41B active and 675B total parameters—and three Ministral 3 edge models (3B, 8B, 14B). All models are released under Apache 2.0 license with multimodal capabilities and are available today on multiple platforms.
Mistral AI adds Deep Research agent, voice mode with Voxtral model to Le Chat
Mistral AI has released a major update to Le Chat, adding a Deep Research agent that generates structured research reports, a new voice input model called Voxtral, and Projects for organizing conversations. The update also includes multilingual reasoning powered by Mistral's Magistral model.
Mistral AI integrates AFP newswire into Le Chat for fact-checked responses
Mistral AI announced a partnership with Agence France-Presse to integrate AFP's newswire content into Le Chat. The assistant will access 2,300 daily stories across six languages—French, English, Spanish, Portuguese, German, and Arabic—from AFP's network of 1,700 journalists.
Mistral AI launches Connectors in Studio with MCP protocol integration and direct tool calling
Mistral AI has released Connectors in Studio, allowing developers to integrate custom MCP (Model Context Protocol) servers and built-in connectors via API/SDK. The release includes direct tool calling for deterministic workflows and human-in-the-loop approval flows for sensitive operations.
Comments
Loading...