model release

Google releases Gemini 3.1 Flash TTS with prompt-directed voice control

TL;DR

Google released Gemini 3.1 Flash TTS, a text-to-speech model that accepts detailed prompts to control voice characteristics, speaking style, accent, and delivery. The model is available through the standard Gemini API using the model ID 'gemini-3.1-flash-tts-preview'.

April 15, 2026 · 5:21 PM2 min read

Gemini 3.1 Flash TTS — Quick Specs

Input$1/1M tokens

Output$20/1M tokens

Compare Gemini 3.1 Flash TTS with other models →

Google releases Gemini 3.1 Flash TTS with prompt-directed voice control

Google released Gemini 3.1 Flash TTS today, a text-to-speech model that accepts detailed prompts to control voice characteristics, speaking style, accent, and delivery.

The model is available through the standard Gemini API using the model ID gemini-3.1-flash-tts-preview. Unlike standard Gemini models, this variant outputs only audio files.

Theatrical prompt engineering approach

The model's prompting system differs significantly from conventional TTS APIs. Google's documentation shows prompts structured like theatrical scripts, including audio profiles, scene descriptions, director's notes, and vocal direction.

Google's example prompt runs several hundred words to generate a few sentences of audio. It specifies:

Character profile ("Jaz R." hosting "The Morning Hype")
Physical scene setting ("glass-walled studio overlooking the moonlit London skyline")
Vocal style notes ("The 'Vocal Smile': You must hear the grin in the audio")
Specific accent direction ("Jaz is from Brixton, London")
Delivery instructions ("High-speed delivery with fluid transitions")
In-line performance cues ([excitedly], [shouting])

The approach treats the model like a voice actor receiving direction rather than a traditional TTS system receiving parameters.

Accent and style flexibility

Developer Simon Willison tested the accent control by modifying the example prompt from "Brixton, London" to "Newcastle." The model generated audio with a Newcastle accent while maintaining the specified energetic delivery style.

The model supports multi-speaker conversations. Google provides preset voice options including "Puck (Upbeat)" and "Kore (Firm)" for dialogue generation.

API details

Pricing information has not been disclosed. The model is currently in preview status, indicated by the "preview" suffix in the model ID.

The API returns WAV format audio files. Google's documentation includes transcript tags for controlling pacing, emotion, and other performance characteristics within the generated speech.

What this means

Gemini 3.1 Flash TTS represents a shift in TTS API design toward natural language control rather than parameter-based configuration. The theatrical prompting approach gives developers fine-grained control over delivery but requires significantly more prompt engineering than traditional TTS systems that use voice IDs and SSML tags.

The preview status and lack of disclosed pricing suggest Google is gathering usage data before general availability. The model's ability to interpret complex scene-setting and vocal direction through prompts alone indicates advancement in models' ability to translate descriptive language into audio characteristics.

For developers, this means choosing between the precision of traditional TTS parameters and the expressiveness of prompt-based direction. The approach may lower barriers for non-technical users who can describe desired outcomes in natural language rather than learning API-specific parameters.

Source: simonwillison.net ↗

text-to-speech gemini google prompt-engineering api voice-synthesis

model releaseJuly 14, 2026

Google releases Gemma 4 E2B, optimized to run natively on Pixel 10's Tensor G5 TPU

Google has released Gemma 4 E2B for TPU, a variant of its open-source Gemma 4 model optimized to run natively on the Tensor G5 chip in Pixel 10 devices. The multimodal model enables completely offline AI chat, image recognition, and audio transcription on Pixel 10, 10 Pro, 10 Pro XL, and 10 Pro Fold.

model releaseJuly 9, 2026

OpenAI releases GPT-5.6 family in three sizes: Luna at $1/$6, Terra at $2.50/$15, Sol at $5/$30 per 1M tokens

OpenAI released its GPT-5.6 flagship model family in three sizes: Luna ($1/$6 per 1M tokens), Terra ($2.50/$15), and Sol ($5/$30). The company claims GPT-5.6 Sol scores 53.6 on the Agents' Last Exam benchmark, outperforming Claude Fable 5's score by 13.1 points.

model releaseJuly 16, 2026

Thinking Machines Lab releases Inkling: 975B-parameter open-weights multimodal model under Apache-2.0

Thinking Machines Lab released Inkling, a Mixture-of-Experts transformer with 975B total parameters and 41B active parameters, trained on 45 trillion tokens of text, images, audio and video. The Apache-2.0 licensed model is designed as a base for fine-tuning rather than a frontier model.

model releaseJuly 16, 2026

Moonshot AI Releases Kimi K3: Open-Weight Multimodal Reasoning Model with 1M Context Window

Moonshot AI has released Kimi K3, an open-weight multimodal reasoning model with a 1-million token context window. The model is priced at $3 per 1M input tokens and $15 per 1M output tokens, available through OpenRouter.

Google releases Gemini 3.1 Flash TTS with prompt-directed voice control

Gemini 3.1 Flash TTS — Quick Specs

Google releases Gemini 3.1 Flash TTS with prompt-directed voice control

Theatrical prompt engineering approach

Accent and style flexibility

API details

What this means

Related Articles

Google releases Gemma 4 E2B, optimized to run natively on Pixel 10's Tensor G5 TPU

OpenAI releases GPT-5.6 family in three sizes: Luna at $1/$6, Terra at $2.50/$15, Sol at $5/$30 per 1M tokens

Thinking Machines Lab releases Inkling: 975B-parameter open-weights multimodal model under Apache-2.0

Moonshot AI Releases Kimi K3: Open-Weight Multimodal Reasoning Model with 1M Context Window

Comments