Google releases Gemini 3.1 Flash Live, its highest-quality audio model for real-time voice AI
Google has released Gemini 3.1 Flash Live, its highest-quality audio and voice model designed for real-time dialogue. The model scores 90.8% on ComplexFuncBench Audio and 36.1% on Scale AI's Audio MultiChallenge with reasoning enabled, with improved tonal understanding and lower latency compared to previous versions.
Google Releases Gemini 3.1 Flash Live, Its Highest-Quality Audio Model
Google has launched Gemini 3.1 Flash Live, a real-time audio and voice model designed to deliver more natural and reliable voice interactions. The model is now available to developers via the Gemini Live API in Google AI Studio, to enterprises through Gemini Enterprise for Customer Experience, and to all users via Gemini Live and Search Live.
Performance Benchmarks
On ComplexFuncBench Audio—which measures multi-step function calling with various constraints—Gemini 3.1 Flash Live achieves 90.8%, outperforming the previous model. On Scale AI's Audio MultiChallenge, which tests complex instruction following and real-world audio conditions including interruptions and hesitations, the model scores 36.1% with "thinking" mode enabled.
Google claims the model delivers improved latency compared to its predecessor, enabling faster response times for voice-first applications. The company also reports enhanced tonal understanding, allowing the model to recognize acoustic nuances like pitch and pace, and to dynamically adjust responses based on user expressions of frustration or confusion.
Developer Features
For developers, Gemini 3.1 Flash Live enables building voice agents capable of executing complex, multi-step tasks in noisy environments. The model supports function calling with improved reliability at scale. In Gemini Live, users can maintain conversation context for twice as long as with the previous model, preserving continuity during extended brainstorming sessions.
Companies including Verizon, LiveKit, and The Home Depot have provided positive feedback on the model's performance in production workflows, highlighting natural conversation quality.
Multilingual and Global Rollout
Gemini 3.1 Flash Live is inherently multilingual, enabling this week's global expansion of Search Live to over 200 countries and territories. Users can now conduct real-time, multimodal conversations with Google Search in their preferred language.
Safety and Watermarking
All audio generated by Gemini 3.1 Flash Live is watermarked using Google's SynthID technology. According to Google, this imperceptible watermark is embedded directly into audio output, enabling reliable detection of AI-generated content to help prevent misinformation.
What This Means
Gemini 3.1 Flash Live represents a meaningful advancement in real-time voice AI, with concrete benchmark improvements in function calling and instruction following. The model's expansion to 200+ countries positions Google to compete more aggressively in voice-first AI interfaces. The SynthID watermarking approach addresses growing regulatory and safety concerns around synthetic audio detection. For enterprises and developers, the improved tonal understanding and lower latency reduce friction in deploying voice agents for customer service and complex task automation.
Related Articles
Google releases Gemini 3.1 Flash Live, its highest-quality audio model for real-time voice AI
Google has released Gemini 3.1 Flash Live, its highest-quality audio model designed for natural and reliable real-time voice interactions. The model scores 90.8% on ComplexFuncBench Audio and 36.1% on Scale AI's Audio MultiChallenge with thinking enabled. It's now available to developers via the Gemini Live API, enterprises through Gemini Enterprise for Customer Experience, and consumers in Search Live and Gemini Live across 200+ countries.
Gemini 3.1 Flash Live scores 95.9% on Big Bench Audio, Google's fastest voice model
Google has released Gemini 3.1 Flash Live, its new voice and audio AI model, scoring 95.9% on the Big Bench Audio Benchmark at high thinking levels—second only to Step-Audio R1.1 Realtime at 97.0%. Response times range from 0.96 seconds at minimal thinking to 2.98 seconds at high thinking, with pricing held at $0.35 per hour of audio input and $1.40 per hour of audio output.
Google releases Gemini 3.1 Flash Live, claims improved audio recognition and lower latency for voice conversations
Google announced Gemini 3.1 Flash Live as its updated audio and voice model for Gemini Live and Search Live. The model claims improved acoustic recognition, better background noise filtering, support for over 90 languages, and lower latency compared to 2.5 Flash Native Audio.
Mistral releases Voxtral TTS, open-source speech model for enterprise voice agents
Mistral AI released Voxtral TTS, an open-source text-to-speech model designed for enterprise voice agents and edge devices. The model supports nine languages, adapts custom voices from samples under five seconds, and achieves 90ms time-to-first-audio latency with a 6x real-time factor.
Comments
Loading...