Google DeepMind releases Gemini 3.1 Flash TTS with audio tags for precise speech control across 70+ languages
Google DeepMind launched Gemini 3.1 Flash TTS, a text-to-speech model that achieved an Elo score of 1,211 on the Artificial Analysis TTS leaderboard. The model introduces audio tags that allow developers to control vocal style, pace, and delivery through natural language commands embedded in text input, with support for 70+ languages.
Google DeepMind releases Gemini 3.1 Flash TTS with audio tags for precise speech control across 70+ languages
Google DeepMind launched Gemini 3.1 Flash TTS, a text-to-speech model that achieved an Elo score of 1,211 on the Artificial Analysis TTS leaderboard based on thousands of blind human preferences. The model is now available in preview via the Gemini API, Google AI Studio, Vertex AI for enterprises, and Google Vids for Workspace users.
Audio tags for granular control
The defining feature of 3.1 Flash TTS is audio tags — natural language commands embedded directly into text input that control vocal style, pace, and delivery. According to Google, these tags provide "improved levels of granularity" for steering AI speech output.
Google AI Studio offers three levels of control:
- Scene direction: Developers can define the environment and provide dialogue instructions to help characters remain "in-character" across multiple turns
- Speaker-level specificity: Unique Audio Profiles can be assigned to characters, with Director's Notes to adjust pace, tone, and accent
- Inline tags: Speakers can change expression mid-sentence, pivoting from high-level settings
Once configured, these parameters can be exported as Gemini API code for consistent voice reproduction across projects.
Performance and availability
Artificial Analysis positioned Gemini 3.1 Flash TTS in its "most attractive quadrant" for combining high-quality speech generation with low cost, though specific pricing was not disclosed. The model supports 70+ languages with what Google describes as "high-fidelity speech" and native multi-speaker dialogue.
All audio generated by the model includes SynthID watermarking — an imperceptible watermark embedded in the audio output designed to enable detection of AI-generated content.
What this means
The introduction of audio tags represents a shift toward programmatic control of AI speech synthesis through natural language rather than complex parameter tuning. The 1,211 Elo score suggests competitive performance against existing TTS models, though direct comparisons to specific competitors weren't provided. The emphasis on multi-speaker dialogue and scene-level control indicates Google is targeting use cases beyond simple text reading — particularly interactive applications, content creation, and localized media production. The mandatory SynthID watermarking addresses growing concerns about audio deepfakes, though the effectiveness of such watermarks against determined adversaries remains an open question in the field.
Related Articles
Google releases Gemini 3.1 Flash TTS with prompt-directed voice control
Google released Gemini 3.1 Flash TTS, a text-to-speech model that accepts detailed prompts to control voice characteristics, speaking style, accent, and delivery. The model is available through the standard Gemini API using the model ID 'gemini-3.1-flash-tts-preview'.
Meta releases Llama Guard 4, a 12B parameter multimodal safety classifier with 164K context window
Meta has released Llama Guard 4, a 12-billion parameter content safety classifier derived from Llama 4 Scout. The model features a 163,840 token context window and can classify both text and image content, available free through OpenRouter with an August 31, 2024 knowledge cutoff.
Google releases Gemma 4, open-source on-device AI with agentic tool use for phones
Google released Gemma 4, an open-source multimodal model that runs entirely on smartphones without sending data to the cloud. The E2B and E4B variants require just 6GB and 8GB of RAM respectively and can autonomously use tools like Wikipedia, maps, and QR code generators through built-in agent skills. The model is available free via the Google AI Edge Gallery app for Android and iOS.
Liquid AI releases LFM2.5-VL-450M, improved 450M-parameter vision-language model with multilingual support
Liquid AI has released LFM2.5-VL-450M, a refreshed 450M-parameter vision-language model built on an updated LFM2.5-350M backbone. The model features a 32,768-token context window, supports 9 languages, handles native 512×512 pixel images, and adds bounding box prediction and function calling capabilities. Performance improvements span both vision and language benchmarks compared to its predecessor.
Comments
Loading...