text-to-speech

10 articles tagged with text-to-speech

May 10, 2026
model release

Supertone releases Supertonic 3: 99M-parameter on-device TTS model supporting 31 languages

Supertone has released Supertonic 3, a 99M-parameter text-to-speech model that runs entirely on-device using ONNX Runtime. The model expands language support from 5 to 31 languages compared to Supertonic 2, requires no GPU, and claims competitive accuracy against models 7-20x larger.

April 15, 2026
model release

Google releases Gemini 3.1 Flash TTS with prompt-directed voice control

Google released Gemini 3.1 Flash TTS, a text-to-speech model that accepts detailed prompts to control voice characteristics, speaking style, accent, and delivery. The model is available through the standard Gemini API using the model ID 'gemini-3.1-flash-tts-preview'.

model releaseGoogle DeepMind

Google DeepMind releases Gemini 3.1 Flash TTS with audio tags for precise speech control across 70+ languages

Google DeepMind launched Gemini 3.1 Flash TTS, a text-to-speech model that achieved an Elo score of 1,211 on the Artificial Analysis TTS leaderboard. The model introduces audio tags that allow developers to control vocal style, pace, and delivery through natural language commands embedded in text input, with support for 70+ languages.

March 30, 2026
product update

Gemini Live voice quality deteriorates after 3.1 Flash update, voices sound nothing like preview

Google's Gemini Live is experiencing persistent voice quality issues following the recent Gemini 3.1 Flash Live update. Users report that voice options like "Capella" (British female accent) have deteriorated significantly, with speech patterns changing dramatically during conversations and audio artifacts like crackles and pops becoming prominent.

March 26, 2026
model release

Mistral releases Voxtral, open-weight TTS model that clones voices from 3 seconds of audio

Mistral has released Voxtral TTS, a 4-billion-parameter text-to-speech model that can clone voices from just three seconds of reference audio across nine languages. The model delivers 70ms latency for typical 10-second samples and outperformed ElevenLabs Flash v2.5 in naturalness tests. Voxtral is available via API at $0.016 per 1,000 characters and as open-weights on Hugging Face.

product updateAmazon Web Services

Amazon Polly adds bidirectional streaming API for real-time speech synthesis in conversational AI

Amazon has released a new Bidirectional Streaming API for Amazon Polly that enables simultaneous text input and audio output over a single HTTP/2 connection. The API reduces end-to-end latency by 39% compared to traditional request-response TTS by allowing text to be sent word-by-word as LLMs generate tokens, rather than waiting for complete sentences. The feature is available in Java, JavaScript, .NET, C++, Go, Kotlin, PHP, Ruby, Rust, and Swift SDKs.

model releaseMistral AI

Mistral releases Voxtral-4B-TTS-2603, open-weights text-to-speech model for production voice agents

Mistral AI released Voxtral-4B-TTS-2603, an open-weights text-to-speech model designed for production voice agents. The 4B-parameter model supports 9 languages, 20 preset voices, achieves 70ms latency at concurrency 1 on a single NVIDIA H200, and requires only 16GB GPU memory.

model release

Mistral releases Voxtral TTS, open-source speech model for enterprise voice agents

Mistral AI released Voxtral TTS, an open-source text-to-speech model designed for enterprise voice agents and edge devices. The model supports nine languages, adapts custom voices from samples under five seconds, and achieves 90ms time-to-first-audio latency with a 6x real-time factor.

March 23, 2026
model releaseXiaomi

Xiaomi launches MiMo-V2-Pro with 1T parameters, matches Claude Opus on coding at 80% lower cost

Xiaomi shipped three AI models simultaneously designed to form a complete agent platform. MiMo-V2-Pro, a 1-trillion-parameter Mixture-of-Experts model with 42 billion active parameters per request, scores 78% on SWE-bench Verified and 81 points on ClawEval—nearly matching Claude Opus 4.6 while costing $1 per million input tokens versus $5 for Opus.

March 11, 2026
model release

Hume AI releases TADA-1B, a 1 billion parameter text-to-speech model

Hume AI has released TADA-1B, a 1 billion parameter text-to-speech model available on Hugging Face under an MIT license. The model, which combines speech and language capabilities, has already accumulated over 3,100 downloads since its January 12 release.