model release

Google releases Gemini 3.1 Flash TTS with prompt-directed voice control

TL;DR

Google released Gemini 3.1 Flash TTS, a text-to-speech model that accepts detailed prompts to control voice characteristics, speaking style, accent, and delivery. The model is available through the standard Gemini API using the model ID 'gemini-3.1-flash-tts-preview'.

2 min read
0

Google releases Gemini 3.1 Flash TTS with prompt-directed voice control

Google released Gemini 3.1 Flash TTS today, a text-to-speech model that accepts detailed prompts to control voice characteristics, speaking style, accent, and delivery.

The model is available through the standard Gemini API using the model ID gemini-3.1-flash-tts-preview. Unlike standard Gemini models, this variant outputs only audio files.

Theatrical prompt engineering approach

The model's prompting system differs significantly from conventional TTS APIs. Google's documentation shows prompts structured like theatrical scripts, including audio profiles, scene descriptions, director's notes, and vocal direction.

Google's example prompt runs several hundred words to generate a few sentences of audio. It specifies:

  • Character profile ("Jaz R." hosting "The Morning Hype")
  • Physical scene setting ("glass-walled studio overlooking the moonlit London skyline")
  • Vocal style notes ("The 'Vocal Smile': You must hear the grin in the audio")
  • Specific accent direction ("Jaz is from Brixton, London")
  • Delivery instructions ("High-speed delivery with fluid transitions")
  • In-line performance cues ([excitedly], [shouting])

The approach treats the model like a voice actor receiving direction rather than a traditional TTS system receiving parameters.

Accent and style flexibility

Developer Simon Willison tested the accent control by modifying the example prompt from "Brixton, London" to "Newcastle." The model generated audio with a Newcastle accent while maintaining the specified energetic delivery style.

The model supports multi-speaker conversations. Google provides preset voice options including "Puck (Upbeat)" and "Kore (Firm)" for dialogue generation.

API details

Pricing information has not been disclosed. The model is currently in preview status, indicated by the "preview" suffix in the model ID.

The API returns WAV format audio files. Google's documentation includes transcript tags for controlling pacing, emotion, and other performance characteristics within the generated speech.

What this means

Gemini 3.1 Flash TTS represents a shift in TTS API design toward natural language control rather than parameter-based configuration. The theatrical prompting approach gives developers fine-grained control over delivery but requires significantly more prompt engineering than traditional TTS systems that use voice IDs and SSML tags.

The preview status and lack of disclosed pricing suggest Google is gathering usage data before general availability. The model's ability to interpret complex scene-setting and vocal direction through prompts alone indicates advancement in models' ability to translate descriptive language into audio characteristics.

For developers, this means choosing between the precision of traditional TTS parameters and the expressiveness of prompt-based direction. The approach may lower barriers for non-technical users who can describe desired outcomes in natural language rather than learning API-specific parameters.

Related Articles

model release

Mistral AI Releases Voxtral: Apache 2.0 Speech Models with 32K Token Context at $0.001/Minute

Mistral AI released Voxtral, a family of open-source speech understanding models available in 24B and 3B parameter variants under Apache 2.0 license. The models support up to 32K token context (30 minutes of audio for transcription, 40 minutes for understanding) and are priced at $0.001 per minute via API—less than half the cost of comparable proprietary systems according to Mistral.

model release

StepFun Releases Step-3.7-Flash: 198B-Parameter Sparse MoE Model With 256K Context in GGUF Format

StepFun has released Step-3.7-Flash, a 198B-parameter sparse Mixture-of-Experts vision-language model that activates approximately 11B parameters per token. The model supports a 256K context window, native image understanding via a 1.8B-parameter vision encoder, and offers three selectable reasoning levels.

model release

NVIDIA Releases Cosmos 3: 8B and 32B Omni-Models Combining Video Generation, Reasoning, and Action in Single Architectur

NVIDIA has released Cosmos 3, a unified omni-model that combines world generation, physical reasoning, and action generation in a single architecture. Available in 8B (Nano) and 32B (Super) parameter versions on Hugging Face, Cosmos 3 uses a Mixture-of-Transformers architecture to process text, image, video, audio, and action modalities without switching between separate models.

model release

MiniMax Launches M3 Model With 1M Context Window at $0.30 Per Million Input Tokens

MiniMax has released M3, a multimodal foundation model supporting text, image, and video inputs with a 1-million-token context window. The model costs $0.30 per million input tokens and $1.20 per million output tokens, available through OpenRouter.

Comments

Loading...