model release

Mistral releases Voxtral, open-weight TTS model that clones voices from 3 seconds of audio

TL;DR

Mistral has released Voxtral TTS, a 4-billion-parameter text-to-speech model that can clone voices from just three seconds of reference audio across nine languages. The model delivers 70ms latency for typical 10-second samples and outperformed ElevenLabs Flash v2.5 in naturalness tests. Voxtral is available via API at $0.016 per 1,000 characters and as open-weights on Hugging Face.

2 min read
0

Mistral Releases Voxtral: Open-Weight TTS Model with Voice Cloning from 3-Second Samples

Mistral has released Voxtral TTS, its first text-to-speech model, positioning it as a compact alternative to closed proprietary systems. The model contains 4 billion parameters and supports nine languages: German, English, French, Spanish, and five others.

Key Technical Specifications

Voxtral's standout capability is voice cloning from minimal audio. The model requires just three seconds of reference audio to adapt to and replicate new voices, with support for emotionally expressive speech synthesis. Latency benchmarks show 70 milliseconds for a typical configuration processing 10-second speech samples with 500 characters of input text.

The model operates across a broader linguistic range than many competing TTS systems, though Mistral has not specified the complete language list beyond the four named examples.

Performance vs. Competitors

In human evaluation tests, Voxtral TTS scored higher on naturalness compared to ElevenLabs Flash v2.5 at comparable response times. However, this comparison has a timing caveat: ElevenLabs subsequently released version 3, which was not included in Mistral's evaluation. This means the benchmark reflects performance against a prior-generation ElevenLabs model rather than current-generation alternatives.

Availability and Pricing

Mistral offers three access paths for Voxtral TTS:

  • API access: $0.016 per 1,000 characters
  • Mistral Studio: Web-based testing interface
  • Open-weights version: Available on Hugging Face for local deployment and fine-tuning

The open-weights release represents a departure from Mistral's approach with some of its larger language models, giving developers the ability to run Voxtral locally without relying on the company's infrastructure.

What This Means

Voxtral establishes Mistral as a competitor in the TTS market beyond its core language modeling business. The 4-billion-parameter size makes it accessible for resource-constrained deployments—substantially smaller than many alternatives—while the open-weights availability appeals to enterprises avoiding vendor lock-in. The three-second voice cloning threshold is practically significant, reducing friction for users who need quick voice adaptation. The API pricing at $0.016 per 1,000 characters is competitive but not a market undercut; comparison requires converting to per-token equivalents based on language-specific tokenization rates. The main strategic value lies in the open-source option, which appeals to builders wanting fine-tuning and deployment flexibility that proprietary APIs don't provide.

Related Articles

model release

Mistral Releases Voxtral TTS: 4B Parameter Text-to-Speech Model at $0.016 per 1k Characters

Mistral AI has released Voxtral TTS, a 4B parameter text-to-speech model supporting 9 languages including English, French, German, Spanish, Dutch, Portuguese, Italian, Hindi, and Arabic. The model achieves 70ms latency for typical inputs and can clone voices from as little as 3 seconds of audio, priced at $0.016 per 1,000 characters.

model release

Mistral OCR 4 Launches With Bounding Boxes, 170 Language Support at $2-4 Per 1,000 Pages

Mistral AI released OCR 4, a compact document extraction model that returns bounding boxes, block classification, and inline confidence scores alongside text. The model supports 170 languages, scores 85.20 on OlmOCRBench, and is priced at $4 per 1,000 pages via API ($2 with batch discount) or $5 per 1,000 pages through Document AI.

model release

Krea Releases 12-Billion Parameter Text-to-Image Model with 8-Step Generation

Krea.ai released Krea 2 Turbo, a 12-billion parameter diffusion transformer model for text-to-image generation. The open-weight model generates images in 8 inference steps and supports resolutions up to 2048x2048 pixels.

model release

Z.ai's GLM-5.2 Matches Claude Opus 4.8 in Agent Tasks, First Open Model to Compete in Coding

Z.ai released GLM-5.2 on June 16, 2026, the first open-weight model to match proprietary models like Claude Opus 4.8 on agent benchmarks. The MIT-licensed model closes the performance gap to 6.8 months behind frontier labs, down from expected 9+ months as compute scales.

Comments

Loading...