Google releases Gemini 3.1 Flash TTS with prompt-directed voice control
Google released Gemini 3.1 Flash TTS, a text-to-speech model that accepts detailed prompts to control voice characteristics, speaking style, accent, and delivery. The model is available through the standard Gemini API using the model ID 'gemini-3.1-flash-tts-preview'.
Google releases Gemini 3.1 Flash TTS with prompt-directed voice control
Google released Gemini 3.1 Flash TTS today, a text-to-speech model that accepts detailed prompts to control voice characteristics, speaking style, accent, and delivery.
The model is available through the standard Gemini API using the model ID gemini-3.1-flash-tts-preview. Unlike standard Gemini models, this variant outputs only audio files.
Theatrical prompt engineering approach
The model's prompting system differs significantly from conventional TTS APIs. Google's documentation shows prompts structured like theatrical scripts, including audio profiles, scene descriptions, director's notes, and vocal direction.
Google's example prompt runs several hundred words to generate a few sentences of audio. It specifies:
- Character profile ("Jaz R." hosting "The Morning Hype")
- Physical scene setting ("glass-walled studio overlooking the moonlit London skyline")
- Vocal style notes ("The 'Vocal Smile': You must hear the grin in the audio")
- Specific accent direction ("Jaz is from Brixton, London")
- Delivery instructions ("High-speed delivery with fluid transitions")
- In-line performance cues (
[excitedly],[shouting])
The approach treats the model like a voice actor receiving direction rather than a traditional TTS system receiving parameters.
Accent and style flexibility
Developer Simon Willison tested the accent control by modifying the example prompt from "Brixton, London" to "Newcastle." The model generated audio with a Newcastle accent while maintaining the specified energetic delivery style.
The model supports multi-speaker conversations. Google provides preset voice options including "Puck (Upbeat)" and "Kore (Firm)" for dialogue generation.
API details
Pricing information has not been disclosed. The model is currently in preview status, indicated by the "preview" suffix in the model ID.
The API returns WAV format audio files. Google's documentation includes transcript tags for controlling pacing, emotion, and other performance characteristics within the generated speech.
What this means
Gemini 3.1 Flash TTS represents a shift in TTS API design toward natural language control rather than parameter-based configuration. The theatrical prompting approach gives developers fine-grained control over delivery but requires significantly more prompt engineering than traditional TTS systems that use voice IDs and SSML tags.
The preview status and lack of disclosed pricing suggest Google is gathering usage data before general availability. The model's ability to interpret complex scene-setting and vocal direction through prompts alone indicates advancement in models' ability to translate descriptive language into audio characteristics.
For developers, this means choosing between the precision of traditional TTS parameters and the expressiveness of prompt-based direction. The approach may lower barriers for non-technical users who can describe desired outcomes in natural language rather than learning API-specific parameters.
Related Articles
Google DeepMind releases Gemini 3.1 Flash TTS with audio tags for precise speech control across 70+ languages
Google DeepMind launched Gemini 3.1 Flash TTS, a text-to-speech model that achieved an Elo score of 1,211 on the Artificial Analysis TTS leaderboard. The model introduces audio tags that allow developers to control vocal style, pace, and delivery through natural language commands embedded in text input, with support for 70+ languages.
Alibaba Qwen Releases 35B Parameter Qwen3.6-35B-A3B Model with 262K Native Context Window
Alibaba Qwen has released Qwen3.6-35B-A3B, a 35-billion parameter mixture-of-experts model with 3 billion activated parameters and a 262,144-token native context window extendable to 1,010,000 tokens. The model scores 73.4 on SWE-bench Verified and features FP8 quantization with performance metrics nearly identical to the original model.
OpenAI releases GPT-Rosalind, biology-focused LLM trained on 50 common research workflows
OpenAI has released GPT-Rosalind, a large language model trained specifically on 50 common biology workflows and major biological databases. Unlike broader science-focused models from competitors, GPT-Rosalind targets specialized biology tasks including pathway analysis, drug target prioritization, and cross-disciplinary research navigation.
Anthropic releases Claude Opus 4.7 with improved coding and vision, confirms it trails unreleased Mythos model
Anthropic released Claude Opus 4.7 with improved coding capabilities, higher-resolution vision, and a new reasoning level. The company publicly acknowledged the model underperforms its unreleased Mythos system, which remains restricted due to safety concerns.
Comments
Loading...