model release

Google releases Gemini 3.1 Flash TTS with prompt-directed voice control

TL;DR

Google released Gemini 3.1 Flash TTS, a text-to-speech model that accepts detailed prompts to control voice characteristics, speaking style, accent, and delivery. The model is available through the standard Gemini API using the model ID 'gemini-3.1-flash-tts-preview'.

2 min read
0

Google releases Gemini 3.1 Flash TTS with prompt-directed voice control

Google released Gemini 3.1 Flash TTS today, a text-to-speech model that accepts detailed prompts to control voice characteristics, speaking style, accent, and delivery.

The model is available through the standard Gemini API using the model ID gemini-3.1-flash-tts-preview. Unlike standard Gemini models, this variant outputs only audio files.

Theatrical prompt engineering approach

The model's prompting system differs significantly from conventional TTS APIs. Google's documentation shows prompts structured like theatrical scripts, including audio profiles, scene descriptions, director's notes, and vocal direction.

Google's example prompt runs several hundred words to generate a few sentences of audio. It specifies:

  • Character profile ("Jaz R." hosting "The Morning Hype")
  • Physical scene setting ("glass-walled studio overlooking the moonlit London skyline")
  • Vocal style notes ("The 'Vocal Smile': You must hear the grin in the audio")
  • Specific accent direction ("Jaz is from Brixton, London")
  • Delivery instructions ("High-speed delivery with fluid transitions")
  • In-line performance cues ([excitedly], [shouting])

The approach treats the model like a voice actor receiving direction rather than a traditional TTS system receiving parameters.

Accent and style flexibility

Developer Simon Willison tested the accent control by modifying the example prompt from "Brixton, London" to "Newcastle." The model generated audio with a Newcastle accent while maintaining the specified energetic delivery style.

The model supports multi-speaker conversations. Google provides preset voice options including "Puck (Upbeat)" and "Kore (Firm)" for dialogue generation.

API details

Pricing information has not been disclosed. The model is currently in preview status, indicated by the "preview" suffix in the model ID.

The API returns WAV format audio files. Google's documentation includes transcript tags for controlling pacing, emotion, and other performance characteristics within the generated speech.

What this means

Gemini 3.1 Flash TTS represents a shift in TTS API design toward natural language control rather than parameter-based configuration. The theatrical prompting approach gives developers fine-grained control over delivery but requires significantly more prompt engineering than traditional TTS systems that use voice IDs and SSML tags.

The preview status and lack of disclosed pricing suggest Google is gathering usage data before general availability. The model's ability to interpret complex scene-setting and vocal direction through prompts alone indicates advancement in models' ability to translate descriptive language into audio characteristics.

For developers, this means choosing between the precision of traditional TTS parameters and the expressiveness of prompt-based direction. The approach may lower barriers for non-technical users who can describe desired outcomes in natural language rather than learning API-specific parameters.

Related Articles

model release

Google DeepMind releases Gemini 3.1 Flash TTS with audio tags for precise speech control across 70+ languages

Google DeepMind launched Gemini 3.1 Flash TTS, a text-to-speech model that achieved an Elo score of 1,211 on the Artificial Analysis TTS leaderboard. The model introduces audio tags that allow developers to control vocal style, pace, and delivery through natural language commands embedded in text input, with support for 70+ languages.

model release

Alibaba Qwen Releases 35B Parameter Qwen3.6-35B-A3B Model with 262K Native Context Window

Alibaba Qwen has released Qwen3.6-35B-A3B, a 35-billion parameter mixture-of-experts model with 3 billion activated parameters and a 262,144-token native context window extendable to 1,010,000 tokens. The model scores 73.4 on SWE-bench Verified and features FP8 quantization with performance metrics nearly identical to the original model.

model release

OpenAI releases GPT-Rosalind, biology-focused LLM trained on 50 common research workflows

OpenAI has released GPT-Rosalind, a large language model trained specifically on 50 common biology workflows and major biological databases. Unlike broader science-focused models from competitors, GPT-Rosalind targets specialized biology tasks including pathway analysis, drug target prioritization, and cross-disciplinary research navigation.

model release

Anthropic releases Claude Opus 4.7 with improved coding and vision, confirms it trails unreleased Mythos model

Anthropic released Claude Opus 4.7 with improved coding capabilities, higher-resolution vision, and a new reasoning level. The company publicly acknowledged the model underperforms its unreleased Mythos system, which remains restricted due to safety concerns.

Comments

Loading...