Mistral releases Voxtral, open-weight TTS model that clones voices from 3 seconds of audio
Mistral has released Voxtral TTS, a 4-billion-parameter text-to-speech model that can clone voices from just three seconds of reference audio across nine languages. The model delivers 70ms latency for typical 10-second samples and outperformed ElevenLabs Flash v2.5 in naturalness tests. Voxtral is available via API at $0.016 per 1,000 characters and as open-weights on Hugging Face.
Mistral Releases Voxtral: Open-Weight TTS Model with Voice Cloning from 3-Second Samples
Mistral has released Voxtral TTS, its first text-to-speech model, positioning it as a compact alternative to closed proprietary systems. The model contains 4 billion parameters and supports nine languages: German, English, French, Spanish, and five others.
Key Technical Specifications
Voxtral's standout capability is voice cloning from minimal audio. The model requires just three seconds of reference audio to adapt to and replicate new voices, with support for emotionally expressive speech synthesis. Latency benchmarks show 70 milliseconds for a typical configuration processing 10-second speech samples with 500 characters of input text.
The model operates across a broader linguistic range than many competing TTS systems, though Mistral has not specified the complete language list beyond the four named examples.
Performance vs. Competitors
In human evaluation tests, Voxtral TTS scored higher on naturalness compared to ElevenLabs Flash v2.5 at comparable response times. However, this comparison has a timing caveat: ElevenLabs subsequently released version 3, which was not included in Mistral's evaluation. This means the benchmark reflects performance against a prior-generation ElevenLabs model rather than current-generation alternatives.
Availability and Pricing
Mistral offers three access paths for Voxtral TTS:
- API access: $0.016 per 1,000 characters
- Mistral Studio: Web-based testing interface
- Open-weights version: Available on Hugging Face for local deployment and fine-tuning
The open-weights release represents a departure from Mistral's approach with some of its larger language models, giving developers the ability to run Voxtral locally without relying on the company's infrastructure.
What This Means
Voxtral establishes Mistral as a competitor in the TTS market beyond its core language modeling business. The 4-billion-parameter size makes it accessible for resource-constrained deployments—substantially smaller than many alternatives—while the open-weights availability appeals to enterprises avoiding vendor lock-in. The three-second voice cloning threshold is practically significant, reducing friction for users who need quick voice adaptation. The API pricing at $0.016 per 1,000 characters is competitive but not a market undercut; comparison requires converting to per-token equivalents based on language-specific tokenization rates. The main strategic value lies in the open-source option, which appeals to builders wanting fine-tuning and deployment flexibility that proprietary APIs don't provide.
Related Articles
Mistral releases Voxtral-4B-TTS-2603, open-weights text-to-speech model for production voice agents
Mistral AI released Voxtral-4B-TTS-2603, an open-weights text-to-speech model designed for production voice agents. The 4B-parameter model supports 9 languages, 20 preset voices, achieves 70ms latency at concurrency 1 on a single NVIDIA H200, and requires only 16GB GPU memory.
Google releases Gemini 3.1 Flash Live, its highest-quality audio model for real-time voice AI
Google has released Gemini 3.1 Flash Live, its highest-quality audio model designed for natural and reliable real-time voice interactions. The model scores 90.8% on ComplexFuncBench Audio and 36.1% on Scale AI's Audio MultiChallenge with thinking enabled. It's now available to developers via the Gemini Live API, enterprises through Gemini Enterprise for Customer Experience, and consumers in Search Live and Gemini Live across 200+ countries.
Mistral releases Voxtral TTS, open-source speech model for enterprise voice agents
Mistral AI released Voxtral TTS, an open-source text-to-speech model designed for enterprise voice agents and edge devices. The model supports nine languages, adapts custom voices from samples under five seconds, and achieves 90ms time-to-first-audio latency with a 6x real-time factor.
Google releases Gemini 3.1 Flash Live, its highest-quality audio model for real-time voice AI
Google has released Gemini 3.1 Flash Live, its highest-quality audio and voice model designed for real-time dialogue. The model scores 90.8% on ComplexFuncBench Audio and 36.1% on Scale AI's Audio MultiChallenge with reasoning enabled, with improved tonal understanding and lower latency compared to previous versions.
Comments
Loading...