model release

Mistral releases Voxtral, open-weight TTS model that clones voices from 3 seconds of audio

TL;DR

Mistral has released Voxtral TTS, a 4-billion-parameter text-to-speech model that can clone voices from just three seconds of reference audio across nine languages. The model delivers 70ms latency for typical 10-second samples and outperformed ElevenLabs Flash v2.5 in naturalness tests. Voxtral is available via API at $0.016 per 1,000 characters and as open-weights on Hugging Face.

2 min read
0

Mistral Releases Voxtral: Open-Weight TTS Model with Voice Cloning from 3-Second Samples

Mistral has released Voxtral TTS, its first text-to-speech model, positioning it as a compact alternative to closed proprietary systems. The model contains 4 billion parameters and supports nine languages: German, English, French, Spanish, and five others.

Key Technical Specifications

Voxtral's standout capability is voice cloning from minimal audio. The model requires just three seconds of reference audio to adapt to and replicate new voices, with support for emotionally expressive speech synthesis. Latency benchmarks show 70 milliseconds for a typical configuration processing 10-second speech samples with 500 characters of input text.

The model operates across a broader linguistic range than many competing TTS systems, though Mistral has not specified the complete language list beyond the four named examples.

Performance vs. Competitors

In human evaluation tests, Voxtral TTS scored higher on naturalness compared to ElevenLabs Flash v2.5 at comparable response times. However, this comparison has a timing caveat: ElevenLabs subsequently released version 3, which was not included in Mistral's evaluation. This means the benchmark reflects performance against a prior-generation ElevenLabs model rather than current-generation alternatives.

Availability and Pricing

Mistral offers three access paths for Voxtral TTS:

  • API access: $0.016 per 1,000 characters
  • Mistral Studio: Web-based testing interface
  • Open-weights version: Available on Hugging Face for local deployment and fine-tuning

The open-weights release represents a departure from Mistral's approach with some of its larger language models, giving developers the ability to run Voxtral locally without relying on the company's infrastructure.

What This Means

Voxtral establishes Mistral as a competitor in the TTS market beyond its core language modeling business. The 4-billion-parameter size makes it accessible for resource-constrained deployments—substantially smaller than many alternatives—while the open-weights availability appeals to enterprises avoiding vendor lock-in. The three-second voice cloning threshold is practically significant, reducing friction for users who need quick voice adaptation. The API pricing at $0.016 per 1,000 characters is competitive but not a market undercut; comparison requires converting to per-token equivalents based on language-specific tokenization rates. The main strategic value lies in the open-source option, which appeals to builders wanting fine-tuning and deployment flexibility that proprietary APIs don't provide.

Related Articles

model release

Google DeepMind Releases Gemma 4 E4B with Multi-Token Prediction for 2x Faster Inference

Google DeepMind released the Gemma 4 E4B assistant model using Multi-Token Prediction (MTP) architecture that accelerates inference by up to 2x through speculative decoding. The 4.5B effective parameter model supports 128K context windows and handles text, image, and audio input with pricing not yet disclosed.

model release

Google DeepMind Releases Gemma 4 26B A4B Assistant Model for 2x Faster Inference via Multi-Token Prediction

Google DeepMind has released a Multi-Token Prediction assistant model for Gemma 4 26B A4B that achieves up to 2x decoding speedup through speculative decoding. The model uses 3.8B active parameters from a 25.2B total parameter MoE architecture with 128 experts and a 256K token context window.

model release

Google DeepMind releases Gemma 4 with 31B dense model, 256K context window, and speculative decoding drafters

Google DeepMind has released Gemma 4, a family of open-weight multimodal models including a 31B dense model with 256K context window and four size variants ranging from 2.3B to 30.7B effective parameters. The release includes Multi-Token Prediction (MTP) draft models that achieve up to 2x decoding speedup through speculative decoding while maintaining identical output quality.

model release

Supertone releases Supertonic 3: 99M-parameter on-device TTS model supporting 31 languages

Supertone has released Supertonic 3, a 99M-parameter text-to-speech model that runs entirely on-device using ONNX Runtime. The model expands language support from 5 to 31 languages compared to Supertonic 2, requires no GPU, and claims competitive accuracy against models 7-20x larger.

Comments

Loading...