model releaseMistral AI

Mistral releases Voxtral-4B-TTS-2603, open-weights text-to-speech model for production voice agents

TL;DR

Mistral AI released Voxtral-4B-TTS-2603, an open-weights text-to-speech model designed for production voice agents. The 4B-parameter model supports 9 languages, 20 preset voices, achieves 70ms latency at concurrency 1 on a single NVIDIA H200, and requires only 16GB GPU memory.

2 min read
0

Mistral Releases Voxtral-4B-TTS-2603 Open Text-to-Speech Model

Mistral AI released Voxtral-4B-TTS-2603, an open-weights text-to-speech model built for production voice agent deployment. The model is distributed under CC BY-NC 4 license with BF16 weights and 20 reference voices.

Performance and Hardware Requirements

Voxtral-4B requires a minimum of 16GB GPU memory and runs on a single NVIDIA H200. Measured on vLLM v0.18.0 with 500-character text input and 10-second audio reference:

  • Single concurrent request: 70ms latency, 0.103 real-time factor (RTF), 119.14 characters/second/GPU throughput
  • 16 concurrent requests: 331ms latency, 0.237 RTF, 879.11 characters/second/GPU throughput
  • 32 concurrent requests: 552ms latency, 0.302 RTF, 1,430.78 characters/second/GPU throughput

Language and Voice Support

The model supports 9 languages: English, French, Spanish, German, Italian, Portuguese, Dutch, Arabic, and Hindi. It includes 20 preset voices with dialect diversity and delivers 24kHz audio output in multiple formats (WAV, PCM, FLAC, MP3, AAC, Opus). Voice customization is available through Mistral's AI Studio.

Technical Architecture

Voxtral-4B is fine-tuned from Mistral's Ministral-3-3B-Base-2512 model. The release includes production-grade support through vLLM-Omni (version >= 0.18.0), developed in collaboration with the vLLM team. The model supports streaming and batch inference modes.

Deployment and Licensing

The model ships with vLLM-Omni integration and includes a Docker image option for containerized deployment. Installation requires vllm >= 0.18.0 and mistral_common >= 1.10.0.

The reference voices inherit CC BY-NC 4 licensing from source datasets (EARS, CML-TTS, IndicVoices-R, Arabic Natural Audio). Mistral specifies users must comply with applicable laws and are responsible for avoiding misuse.

Stated Use Cases

Mistral positions Voxtral-4B for customer support, financial services KYC workflows, manufacturing operations, government services, supply chain logistics, in-vehicle systems, sales and marketing, and real-time translation.

What This Means

Voxtral-4B represents Mistral's entry into the open-source TTS space, competing against closed commercial solutions. The sub-100ms latency and 4B parameter count target production deployments with moderate hardware requirements. CC BY-NC licensing restricts commercial use to Mistral's terms, limiting adoption for commercial SaaS applications compared to permissive open licenses. The model's performance at 32 concurrent requests (1,430 characters/second throughput) positions it for real-time voice agent infrastructure, though practical throughput will depend on actual workload patterns and hardware availability.

Related Articles

model release

Supertone releases Supertonic 3: 99M-parameter on-device TTS model supporting 31 languages

Supertone has released Supertonic 3, a 99M-parameter text-to-speech model that runs entirely on-device using ONNX Runtime. The model expands language support from 5 to 31 languages compared to Supertonic 2, requires no GPU, and claims competitive accuracy against models 7-20x larger.

model release

Baidu Releases Qianfan-OCR-Fast Model with 66K Context at $0.68 Per 1M Input Tokens

Baidu has released Qianfan-OCR-Fast, a multimodal model specialized for optical character recognition tasks. The model offers a 66,000 token context window and is priced at $0.68 per 1M input tokens and $2.81 per 1M output tokens.

model release

Google DeepMind Releases Gemma 4 E4B with Multi-Token Prediction for 2x Faster Inference

Google DeepMind released the Gemma 4 E4B assistant model using Multi-Token Prediction (MTP) architecture that accelerates inference by up to 2x through speculative decoding. The 4.5B effective parameter model supports 128K context windows and handles text, image, and audio input with pricing not yet disclosed.

model release

Tencent Releases Hy3 Preview: Mixture-of-Experts Model with 262K Context and Configurable Reasoning

Tencent has released Hy3 preview, a Mixture-of-Experts model with a 262,144 token context window priced at $0.066 per million input tokens and $0.26 per million output tokens. The model features three configurable reasoning modes—disabled, low, and high—designed for agentic workflows and production environments.

Comments

Loading...

Voxtral-4B-TTS-2603: Mistral's Open TTS Model | TPS