Mistral releases Voxtral, open-weight TTS model that clones voices from 3 seconds of audio

Mistral has released Voxtral TTS, a 4-billion-parameter text-to-speech model that can clone voices from just three seconds of reference audio across nine languages. The model delivers 70ms latency for typical 10-second samples and outperformed ElevenLabs Flash v2.5 in naturalness tests. Voxtral is available via API at $0.016 per 1,000 characters and as open-weights on Hugging Face.

March 26, 2026 · 7:35 PM2 min read

text-to-speech voice-cloning open-weights

Voxtral TTS

Version History

Coverage

Mistral releases Voxtral, open-weight TTS model that clones voices from 3 seconds of audio