model release

Google releases Gemini 3.1 Flash Live, its highest-quality audio model for real-time voice AI

TL;DR

Google has released Gemini 3.1 Flash Live, its highest-quality audio and voice model designed for real-time dialogue. The model scores 90.8% on ComplexFuncBench Audio and 36.1% on Scale AI's Audio MultiChallenge with reasoning enabled, with improved tonal understanding and lower latency compared to previous versions.

2 min read
0

Google Releases Gemini 3.1 Flash Live, Its Highest-Quality Audio Model

Google has launched Gemini 3.1 Flash Live, a real-time audio and voice model designed to deliver more natural and reliable voice interactions. The model is now available to developers via the Gemini Live API in Google AI Studio, to enterprises through Gemini Enterprise for Customer Experience, and to all users via Gemini Live and Search Live.

Performance Benchmarks

On ComplexFuncBench Audio—which measures multi-step function calling with various constraints—Gemini 3.1 Flash Live achieves 90.8%, outperforming the previous model. On Scale AI's Audio MultiChallenge, which tests complex instruction following and real-world audio conditions including interruptions and hesitations, the model scores 36.1% with "thinking" mode enabled.

Google claims the model delivers improved latency compared to its predecessor, enabling faster response times for voice-first applications. The company also reports enhanced tonal understanding, allowing the model to recognize acoustic nuances like pitch and pace, and to dynamically adjust responses based on user expressions of frustration or confusion.

Developer Features

For developers, Gemini 3.1 Flash Live enables building voice agents capable of executing complex, multi-step tasks in noisy environments. The model supports function calling with improved reliability at scale. In Gemini Live, users can maintain conversation context for twice as long as with the previous model, preserving continuity during extended brainstorming sessions.

Companies including Verizon, LiveKit, and The Home Depot have provided positive feedback on the model's performance in production workflows, highlighting natural conversation quality.

Multilingual and Global Rollout

Gemini 3.1 Flash Live is inherently multilingual, enabling this week's global expansion of Search Live to over 200 countries and territories. Users can now conduct real-time, multimodal conversations with Google Search in their preferred language.

Safety and Watermarking

All audio generated by Gemini 3.1 Flash Live is watermarked using Google's SynthID technology. According to Google, this imperceptible watermark is embedded directly into audio output, enabling reliable detection of AI-generated content to help prevent misinformation.

What This Means

Gemini 3.1 Flash Live represents a meaningful advancement in real-time voice AI, with concrete benchmark improvements in function calling and instruction following. The model's expansion to 200+ countries positions Google to compete more aggressively in voice-first AI interfaces. The SynthID watermarking approach addresses growing regulatory and safety concerns around synthetic audio detection. For enterprises and developers, the improved tonal understanding and lower latency reduce friction in deploying voice agents for customer service and complex task automation.

Related Articles

model release

Google releases Gemini 3.1 Flash Image, claims Pro-level quality at $0.50 per 1M tokens

Google has released Gemini 3.1 Flash Image, internally codenamed "Nano Banana 2," an image generation and editing model with a 131K context window. The model is priced at $0.50 per 1M input tokens and $3 per 1M output tokens.

model release

Mistral Releases Mistral 3 Family: 675B-Parameter Large 3 MoE and Three Edge Models Under Apache 2.0

Mistral has released Mistral 3, including Mistral Large 3—a sparse mixture-of-experts model with 41B active and 675B total parameters—and three Ministral 3 edge models (3B, 8B, 14B). All models are released under Apache 2.0 license with multimodal capabilities and are available today on multiple platforms.

model release

Mistral OCR 4 Launches With Bounding Boxes, 170 Language Support at $2-4 Per 1,000 Pages

Mistral AI released OCR 4, a compact document extraction model that returns bounding boxes, block classification, and inline confidence scores alongside text. The model supports 170 languages, scores 85.20 on OlmOCRBench, and is priced at $4 per 1,000 pages via API ($2 with batch discount) or $5 per 1,000 pages through Document AI.

model release

Mistral Releases Voxtral TTS: 4B Parameter Text-to-Speech Model at $0.016 per 1k Characters

Mistral AI has released Voxtral TTS, a 4B parameter text-to-speech model supporting 9 languages including English, French, German, Spanish, Dutch, Portuguese, Italian, Hindi, and Arabic. The model achieves 70ms latency for typical inputs and can clone voices from as little as 3 seconds of audio, priced at $0.016 per 1,000 characters.

Comments

Loading...