Mistral Releases Voxtral TTS: 4B Parameter Text-to-Speech Model at $0.016 per 1k Characters

TL;DR

Mistral AI has released Voxtral TTS, a 4B parameter text-to-speech model supporting 9 languages including English, French, German, Spanish, Dutch, Portuguese, Italian, Hindi, and Arabic. The model achieves 70ms latency for typical inputs and can clone voices from as little as 3 seconds of audio, priced at $0.016 per 1,000 characters.

June 18, 2026 · 9:07 AM2 min read

Voxtral TTS — Quick Specs

Compare Voxtral TTS with other models →

Mistral Releases Voxtral TTS: 4B Parameter Text-to-Speech Model at $0.016 per 1k Characters

Mistral AI has released Voxtral TTS, a 4B parameter text-to-speech model supporting 9 languages with voice cloning capabilities from as little as 3 seconds of audio.

Technical Specifications

The model consists of three components:

3.4B parameter transformer decoder backbone (built on Ministral 3B)
390M parameter flow-matching acoustic transformer
300M parameter neural audio codec

Voxtral TTS achieves 70ms model latency for typical inputs (10-second voice sample, 500 characters) with a real-time factor of approximately 9.7x. The model natively generates up to 2 minutes of audio, with the API handling longer generations through smart interleaving.

Language Support and Capabilities

Voxtral TTS supports 9 languages: English, French, German, Spanish, Dutch, Portuguese, Italian, Hindi, and Arabic. The model can adapt to custom voices using reference samples as short as 3 seconds, capturing voice characteristics including accent, inflections, intonations, and disfluencies.

The model demonstrates zero-shot cross-lingual voice adaptation despite not being explicitly trained for it. For example, it can generate English speech using a French voice prompt, producing natural-sounding French-accented English.

Architecture Details

Voxtral TTS uses a transformer-based, autoregressive, flow-matching architecture. The in-house codec processes audio causally using semantic VQ (8,192 vocabulary) and acoustic FSQ (36 dimensions, 21 levels) latent representations, producing them at 12.5Hz frame rate. The flow-matching transformer runs 16 function evaluations per audio frame to produce acoustic latents.

Performance Benchmarks

According to Mistral AI, human evaluations by native speakers show Voxtral TTS achieves superior naturalness compared to ElevenLabs Flash v2.5 while maintaining similar time-to-first-audio. The company claims performance parity with ElevenLabs v3 in quality, with support for emotion-steering.

In zero-shot custom voice evaluations across 9 languages, Mistral conducted side-by-side preference tests with 3 annotators per language pair, measuring naturalness, accent adherence, and acoustic similarity. The company claims Voxtral TTS outperformed ElevenLabs v2.5 Flash in this multilingual custom voice setting.

Pricing and Availability

Voxtral TTS is available via API at $0.016 per 1,000 characters. The model can be tested in Mistral Studio and Le Chat. An open-weight version with reference voices is available on Hugging Face under CC BY-NC 4.0 license.

What This Means

Mistral's entry into text-to-speech with a compact 4B parameter model and competitive pricing positions it against established players like ElevenLabs. The 70ms latency and 3-second voice cloning capability make it viable for real-time voice agent applications. The open-weight release under a non-commercial license follows Mistral's hybrid approach of offering both commercial API access and community model weights. Cross-lingual voice adaptation without explicit training is a notable capability that could simplify speech-to-speech translation pipelines.

Source: mistral.ai ↗

mistral-ai text-to-speech voice-cloning voxtral model-release tts multilingual open-weights

model releaseJuly 31, 2026

Thinking Machines Releases Inkling Small, a 12B-Active-Parameter Model That Beats Its Larger Predecessor on Key Benchmar

Thinking Machines has released Inkling Small, an open-weights reasoning model with 276 billion total parameters but only 12 billion active. According to Artificial Analysis, it scores nearly as high as the company's larger Inkling model while using roughly a third of the parameters and far fewer output tokens per task.

model releaseAugust 2, 2026

Anthropic's Claude Opus 5 Generates Full 3D Games From a Single Text Prompt, No Assets Required

Anthropic's Claude Opus 5 can generate playable 3D games, including first-person shooters and Minecraft clones, from a single text prompt with zero external assets. Community tests claim it outperforms GPT-5.6 Sol and Kimi K3 in physics realism and mechanical complexity, though no standardized benchmark has confirmed the comparisons.

model releaseAugust 1, 2026

ByteDance's Seedance 2.5 Generates 30-Second AI Video Clips With Synced Audio

ByteDance released Seedance 2.5, an AI video model that generates synchronized video and audio in a single pass, producing clips up to 30 seconds long that can be extended further. That's roughly triple the length of Google's Gemini Omni Flash.

model releaseAugust 1, 2026

OpenAI Reportedly Developing 'Astra' Model Family for Multi-Day Autonomous Problem-Solving

OpenAI is reportedly developing a new model family called Astra, designed to coordinate multiple agents on complex problems over hours or days. The models are already in testing and would be first to go through a planned U.S. government pre-release review, according to The Information.

Mistral Releases Voxtral TTS: 4B Parameter Text-to-Speech Model at $0.016 per 1k Characters

Voxtral TTS — Quick Specs

Mistral Releases Voxtral TTS: 4B Parameter Text-to-Speech Model at $0.016 per 1k Characters

Technical Specifications

Language Support and Capabilities

Architecture Details

Performance Benchmarks

Pricing and Availability

What This Means

Related Articles

Thinking Machines Releases Inkling Small, a 12B-Active-Parameter Model That Beats Its Larger Predecessor on Key Benchmar

Anthropic's Claude Opus 5 Generates Full 3D Games From a Single Text Prompt, No Assets Required

ByteDance's Seedance 2.5 Generates 30-Second AI Video Clips With Synced Audio

OpenAI Reportedly Developing 'Astra' Model Family for Multi-Day Autonomous Problem-Solving

Comments