Mistral releases Voxtral-4B-TTS-2603, open-weights text-to-speech model for production voice agents
Mistral AI released Voxtral-4B-TTS-2603, an open-weights text-to-speech model designed for production voice agents. The 4B-parameter model supports 9 languages, 20 preset voices, achieves 70ms latency at concurrency 1 on a single NVIDIA H200, and requires only 16GB GPU memory.
Mistral Releases Voxtral-4B-TTS-2603 Open Text-to-Speech Model
Mistral AI released Voxtral-4B-TTS-2603, an open-weights text-to-speech model built for production voice agent deployment. The model is distributed under CC BY-NC 4 license with BF16 weights and 20 reference voices.
Performance and Hardware Requirements
Voxtral-4B requires a minimum of 16GB GPU memory and runs on a single NVIDIA H200. Measured on vLLM v0.18.0 with 500-character text input and 10-second audio reference:
- Single concurrent request: 70ms latency, 0.103 real-time factor (RTF), 119.14 characters/second/GPU throughput
- 16 concurrent requests: 331ms latency, 0.237 RTF, 879.11 characters/second/GPU throughput
- 32 concurrent requests: 552ms latency, 0.302 RTF, 1,430.78 characters/second/GPU throughput
Language and Voice Support
The model supports 9 languages: English, French, Spanish, German, Italian, Portuguese, Dutch, Arabic, and Hindi. It includes 20 preset voices with dialect diversity and delivers 24kHz audio output in multiple formats (WAV, PCM, FLAC, MP3, AAC, Opus). Voice customization is available through Mistral's AI Studio.
Technical Architecture
Voxtral-4B is fine-tuned from Mistral's Ministral-3-3B-Base-2512 model. The release includes production-grade support through vLLM-Omni (version >= 0.18.0), developed in collaboration with the vLLM team. The model supports streaming and batch inference modes.
Deployment and Licensing
The model ships with vLLM-Omni integration and includes a Docker image option for containerized deployment. Installation requires vllm >= 0.18.0 and mistral_common >= 1.10.0.
The reference voices inherit CC BY-NC 4 licensing from source datasets (EARS, CML-TTS, IndicVoices-R, Arabic Natural Audio). Mistral specifies users must comply with applicable laws and are responsible for avoiding misuse.
Stated Use Cases
Mistral positions Voxtral-4B for customer support, financial services KYC workflows, manufacturing operations, government services, supply chain logistics, in-vehicle systems, sales and marketing, and real-time translation.
What This Means
Voxtral-4B represents Mistral's entry into the open-source TTS space, competing against closed commercial solutions. The sub-100ms latency and 4B parameter count target production deployments with moderate hardware requirements. CC BY-NC licensing restricts commercial use to Mistral's terms, limiting adoption for commercial SaaS applications compared to permissive open licenses. The model's performance at 32 concurrent requests (1,430 characters/second throughput) positions it for real-time voice agent infrastructure, though practical throughput will depend on actual workload patterns and hardware availability.
Related Articles
OpenAI previews GPT-5.6 to select partners with three variants priced from $1 to $30 per million tokens
OpenAI has begun previewing its GPT-5.6 series to a limited group of trusted partners after government review. The release includes three variants: Sol at $5 input/$30 output per million tokens, Terra at $2.50/$15, and Luna at $1/$6.
OpenAI announces GPT-5.6 series with Sol flagship, Terra at 50% cost of GPT-5.5, and Luna budget model
OpenAI has begun a limited preview of its GPT-5.6 series, introducing three models: Sol (flagship), Terra (2x cheaper than GPT-5.5 with competitive performance), and Luna (lowest cost option). The models are launching first with trusted partners before general availability in coming weeks, following U.S. government preview requirements.
OpenAI's ChatGPT 5.6 release restricted to government-approved customers initially
OpenAI will release ChatGPT 5.6 first to customers approved by the federal government, according to a staff memo from CEO Sam Altman. The company plans a broader release "a couple of weeks later," marking a significant departure from typical model rollouts.
DeepSeek-V4-Fable: Offensive Security Model Trained on 80,000 CTF Trajectories Achieves 58.7% Solve Rate
Chunjiang Intelligence has released DeepSeek-V4-Fable, an autonomous agent model designed for offensive security research and CTF challenges. The model, distilled from Claude-5-Fable and built on DeepSeek-V4-Flash, was trained on 80,000 verified CTF trajectories and achieves a 58.7% solve rate across held-out security challenges.
Comments
Loading...