model release

Mistral releases Voxtral TTS, open-source speech model for enterprise voice agents

TL;DR

Mistral AI released Voxtral TTS, an open-source text-to-speech model designed for enterprise voice agents and edge devices. The model supports nine languages, adapts custom voices from samples under five seconds, and achieves 90ms time-to-first-audio latency with a 6x real-time factor.

March 26, 2026 · 11:35 AM2 min read

Voxtral TTS — Quick Specs

Compare Voxtral TTS with other models →

Mistral AI released Voxtral TTS on Thursday, an open-source text-to-speech model targeting enterprise voice applications and edge deployment. The model directly competes with ElevenLabs, Deepgram, and OpenAI's voice offerings.

Model Specifications

Voxtral TTS supports nine languages: English, French, German, Spanish, Dutch, Portuguese, Italian, Hindi, and Arabic. The model is based on Ministral 3B and designed for real-time performance with a time-to-first-audio (TTFA) of 90 milliseconds for a 10-second, 500-character sample. Its real-time factor (RTF) is 6x, meaning it can render a 10-second audio clip in approximately 1.6 seconds.

The model adapts to custom voices from samples shorter than five seconds while preserving accent, inflection, intonation, and speech irregularities. According to Mistral, it can switch between languages without losing voice characteristics—useful for dubbing and real-time translation applications.

Positioning and Capabilities

Pierre Stock, VP of science operations at Mistral AI, told TechCrunch that the company built "a small-sized speech model that can fit on a smartwatch, a smartphone, a laptop, or other edge devices" with "a cost that is a fraction of anything else on the market." The company emphasizes human-sounding output and real-time performance as core differentiators.

Mistral positions the open-source nature and customization flexibility as competitive advantages, allowing enterprises to tune models for specific use cases rather than relying on proprietary, managed solutions.

Strategic Context

Voxtral TTS complements Mistral's earlier 2026 releases of transcription models for batch and real-time processing. Stock indicated the company plans "an end-to-end platform that can handle multimodal streams of input, including audio, text, and image and output as well," suggesting a broader vision for agentic systems that process multiple modalities.

Pricing details were not disclosed. Availability for open-source use or commercial deployment terms remain unspecified.

What this means

Mistral is building a complete voice AI stack to compete with specialized speech companies and large language model providers offering voice capabilities. The open-source release strategy trades proprietary advantage for developer adoption and enterprise customization flexibility. The 90ms latency and edge-device focus suggest targeting real-time conversational agents rather than pre-rendered content, positioning against both traditional TTS vendors and API-based competitors.

Source: techcrunch.com ↗

mistral-ai text-to-speech open-source voice-ai speech-generation multimodal edge-computing enterprise-ai

model releaseMay 6, 2026

IBM Releases Granite Embedding 311M R2 With 32K Context, 200+ Language Support

IBM released Granite Embedding 311M Multilingual R2, a 311-million parameter dense embedding model with 32,768-token context length and support for 200+ languages. The model scores 64.0 on Multilingual MTEB Retrieval (18 tasks), an 11.8-point improvement over its predecessor, and ships with ONNX and OpenVINO models for production deployment.

model releaseMay 1, 2026

IBM Releases Granite 4.1 30B With 131K Context Window and Enhanced Tool-Calling

IBM released Granite 4.1 30B, a 30-billion parameter instruction-following model with a 131,072 token context window. The model scores 80.16 on MMLU 5-shot and 88.41 on HumanEval pass@1, with enhanced tool-calling capabilities following OpenAI's function definition schema.

model releaseMay 10, 2026

Supertone releases Supertonic 3: 99M-parameter on-device TTS model supporting 31 languages

Supertone has released Supertonic 3, a 99M-parameter text-to-speech model that runs entirely on-device using ONNX Runtime. The model expands language support from 5 to 31 languages compared to Supertonic 2, requires no GPU, and claims competitive accuracy against models 7-20x larger.