Google DeepMind releases Gemini 3.1 Flash TTS with audio tags for precise speech control across 70+ languages
Google DeepMind launched Gemini 3.1 Flash TTS, a text-to-speech model that achieved an Elo score of 1,211 on the Artificial Analysis TTS leaderboard. The model introduces audio tags that allow developers to control vocal style, pace, and delivery through natural language commands embedded in text input, with support for 70+ languages.
Google DeepMind releases Gemini 3.1 Flash TTS with audio tags for precise speech control across 70+ languages
Google DeepMind launched Gemini 3.1 Flash TTS, a text-to-speech model that achieved an Elo score of 1,211 on the Artificial Analysis TTS leaderboard based on thousands of blind human preferences. The model is now available in preview via the Gemini API, Google AI Studio, Vertex AI for enterprises, and Google Vids for Workspace users.
Audio tags for granular control
The defining feature of 3.1 Flash TTS is audio tags — natural language commands embedded directly into text input that control vocal style, pace, and delivery. According to Google, these tags provide "improved levels of granularity" for steering AI speech output.
Google AI Studio offers three levels of control:
- Scene direction: Developers can define the environment and provide dialogue instructions to help characters remain "in-character" across multiple turns
- Speaker-level specificity: Unique Audio Profiles can be assigned to characters, with Director's Notes to adjust pace, tone, and accent
- Inline tags: Speakers can change expression mid-sentence, pivoting from high-level settings
Once configured, these parameters can be exported as Gemini API code for consistent voice reproduction across projects.
Performance and availability
Artificial Analysis positioned Gemini 3.1 Flash TTS in its "most attractive quadrant" for combining high-quality speech generation with low cost, though specific pricing was not disclosed. The model supports 70+ languages with what Google describes as "high-fidelity speech" and native multi-speaker dialogue.
All audio generated by the model includes SynthID watermarking — an imperceptible watermark embedded in the audio output designed to enable detection of AI-generated content.
What this means
The introduction of audio tags represents a shift toward programmatic control of AI speech synthesis through natural language rather than complex parameter tuning. The 1,211 Elo score suggests competitive performance against existing TTS models, though direct comparisons to specific competitors weren't provided. The emphasis on multi-speaker dialogue and scene-level control indicates Google is targeting use cases beyond simple text reading — particularly interactive applications, content creation, and localized media production. The mandatory SynthID watermarking addresses growing concerns about audio deepfakes, though the effectiveness of such watermarks against determined adversaries remains an open question in the field.
Related Articles
StepFun launches Step 3.7 Flash: 196B MoE model with 256K context and adjustable reasoning levels at $0.20/$1.15 per 1M
StepFun has released Step 3.7 Flash, a 196B-parameter Mixture-of-Experts model that activates approximately 11B parameters per token. The multimodal model supports a 256K context window and introduces selectable reasoning levels (high/medium/low), priced at $0.20 per 1M input tokens and $1.15 per 1M output tokens.
Mistral AI Releases Small 4: 119B Parameter Open-Source Model with 256K Context Under Apache 2.0
Mistral AI has released Mistral Small 4, a 119B total parameter mixture-of-experts model with 256K context window and native multimodal capabilities. The model uses 128 experts with 4 active per token (6B active parameters) and is released under the Apache 2.0 license, marking Mistral's first unified model combining reasoning, multimodal, and coding capabilities.
Mistral Releases Mistral Large 3 with 675B Parameters and Three Ministral 3 Models Under Apache 2.0
Mistral AI has released Mistral 3, consisting of Mistral Large 3—a sparse mixture-of-experts model with 675B total parameters and 41B active parameters—and three Ministral 3 models at 3B, 8B, and 14B parameters. All models are released under the Apache 2.0 license with multimodal capabilities including image understanding.
Mistral AI Releases Voxtral: Apache 2.0 Speech Models with 32K Token Context at $0.001/Minute
Mistral AI released Voxtral, a family of open-source speech understanding models available in 24B and 3B parameter variants under Apache 2.0 license. The models support up to 32K token context (30 minutes of audio for transcription, 40 minutes for understanding) and are priced at $0.001 per minute via API—less than half the cost of comparable proprietary systems according to Mistral.
Comments
Loading...