model releaseMistral AI

Mistral releases Voxtral-4B-TTS-2603, open-weights text-to-speech model for production voice agents

TL;DR

Mistral AI released Voxtral-4B-TTS-2603, an open-weights text-to-speech model designed for production voice agents. The 4B-parameter model supports 9 languages, 20 preset voices, achieves 70ms latency at concurrency 1 on a single NVIDIA H200, and requires only 16GB GPU memory.

2 min read
0

Mistral Releases Voxtral-4B-TTS-2603 Open Text-to-Speech Model

Mistral AI released Voxtral-4B-TTS-2603, an open-weights text-to-speech model built for production voice agent deployment. The model is distributed under CC BY-NC 4 license with BF16 weights and 20 reference voices.

Performance and Hardware Requirements

Voxtral-4B requires a minimum of 16GB GPU memory and runs on a single NVIDIA H200. Measured on vLLM v0.18.0 with 500-character text input and 10-second audio reference:

  • Single concurrent request: 70ms latency, 0.103 real-time factor (RTF), 119.14 characters/second/GPU throughput
  • 16 concurrent requests: 331ms latency, 0.237 RTF, 879.11 characters/second/GPU throughput
  • 32 concurrent requests: 552ms latency, 0.302 RTF, 1,430.78 characters/second/GPU throughput

Language and Voice Support

The model supports 9 languages: English, French, Spanish, German, Italian, Portuguese, Dutch, Arabic, and Hindi. It includes 20 preset voices with dialect diversity and delivers 24kHz audio output in multiple formats (WAV, PCM, FLAC, MP3, AAC, Opus). Voice customization is available through Mistral's AI Studio.

Technical Architecture

Voxtral-4B is fine-tuned from Mistral's Ministral-3-3B-Base-2512 model. The release includes production-grade support through vLLM-Omni (version >= 0.18.0), developed in collaboration with the vLLM team. The model supports streaming and batch inference modes.

Deployment and Licensing

The model ships with vLLM-Omni integration and includes a Docker image option for containerized deployment. Installation requires vllm >= 0.18.0 and mistral_common >= 1.10.0.

The reference voices inherit CC BY-NC 4 licensing from source datasets (EARS, CML-TTS, IndicVoices-R, Arabic Natural Audio). Mistral specifies users must comply with applicable laws and are responsible for avoiding misuse.

Stated Use Cases

Mistral positions Voxtral-4B for customer support, financial services KYC workflows, manufacturing operations, government services, supply chain logistics, in-vehicle systems, sales and marketing, and real-time translation.

What This Means

Voxtral-4B represents Mistral's entry into the open-source TTS space, competing against closed commercial solutions. The sub-100ms latency and 4B parameter count target production deployments with moderate hardware requirements. CC BY-NC licensing restricts commercial use to Mistral's terms, limiting adoption for commercial SaaS applications compared to permissive open licenses. The model's performance at 32 concurrent requests (1,430 characters/second throughput) positions it for real-time voice agent infrastructure, though practical throughput will depend on actual workload patterns and hardware availability.

Related Articles

model release

Mistral releases Voxtral, open-weight TTS model that clones voices from 3 seconds of audio

Mistral has released Voxtral TTS, a 4-billion-parameter text-to-speech model that can clone voices from just three seconds of reference audio across nine languages. The model delivers 70ms latency for typical 10-second samples and outperformed ElevenLabs Flash v2.5 in naturalness tests. Voxtral is available via API at $0.016 per 1,000 characters and as open-weights on Hugging Face.

model release

Mistral releases Voxtral TTS, open-source speech model for enterprise voice agents

Mistral AI released Voxtral TTS, an open-source text-to-speech model designed for enterprise voice agents and edge devices. The model supports nine languages, adapts custom voices from samples under five seconds, and achieves 90ms time-to-first-audio latency with a 6x real-time factor.

model release

NVIDIA releases gpt-oss-puzzle-88B, 88B-parameter reasoning model with 1.63× throughput gains

NVIDIA released gpt-oss-puzzle-88B on March 26, 2026, a 88-billion parameter mixture-of-experts model optimized for inference efficiency on H100 hardware. Built using the Puzzle post-training neural architecture search framework, the model achieves 1.63× throughput improvement in long-context (64K/64K) scenarios and up to 2.82× improvement on single H100 GPUs compared to its parent gpt-oss-120B, while matching or exceeding accuracy across reasoning effort levels.

model release

Cohere releases 2B open-source speech model with 5.42% word error rate

Cohere has released Transcribe, a 2 billion parameter open-source automatic speech recognition model that the company claims tops the Hugging Face Open ASR Leaderboard with a 5.42% word error rate. The model supports 14 languages and is available under Apache 2.0 license, outperforming OpenAI's Whisper Large v3 and competing models on both accuracy and throughput metrics.

Comments

Loading...