Amazon Polly adds bidirectional streaming API for real-time speech synthesis in conversational AI
Amazon has released a new Bidirectional Streaming API for Amazon Polly that enables simultaneous text input and audio output over a single HTTP/2 connection. The API reduces end-to-end latency by 39% compared to traditional request-response TTS by allowing text to be sent word-by-word as LLMs generate tokens, rather than waiting for complete sentences. The feature is available in Java, JavaScript, .NET, C++, Go, Kotlin, PHP, Ruby, Rust, and Swift SDKs.
Amazon Polly Adds Real-Time Bidirectional Streaming for Conversational AI
Amazon has released a new Bidirectional Streaming API for Amazon Polly that enables real-time text-to-speech synthesis where text and audio flow simultaneously over a single connection.
The Problem with Traditional TTS
Conventional text-to-speech APIs operate in request-response mode: developers must collect the complete text before making a synthesis request. For conversational AI applications powered by large language models (LLMs)—which generate text token-by-token—this creates a bottleneck. Users must wait for:
- The LLM to finish generating the complete response
- The TTS service to synthesize the entire text
- Audio to download before playback begins
Amazon Polly previously supported streaming audio output, but required complete input text upfront.
How Bidirectional Streaming Works
The new StartSpeechSynthesisStream API introduces true duplex communication:
- Send text incrementally: Stream text to Amazon Polly as it becomes available, word-by-word
- Receive audio immediately: Get synthesized audio bytes back in real-time as they're generated
- Control timing: Use flush configuration to trigger synthesis of buffered text
- Single connection: HTTP/2 enables simultaneous bidirectional flow
Key components include TextEvent (client → service), CloseStreamEvent (client → service), AudioEvent (service → client), and StreamClosedEvent (service → client).
Performance Improvements
Amazon benchmarked the bidirectional API against the traditional SynthesizeSpeech API using identical test conditions: 7,045 characters of prose (970 words) with the Matthew voice, Generative engine, MP3 output at 24kHz.
Simulation conditions: LLM generating tokens at ~30ms per word.
| Metric | Traditional API | Bidirectional | Improvement |
|---|---|---|---|
| Total processing time | 115,226 ms | 70,071 ms | 39% faster |
| API calls | 27 | 1 | 27x reduction |
| Total audio bytes | 2,354,292 | 2,324,636 | Similar |
The traditional API buffers words until sentence boundaries are reached, then sends complete sentences as separate requests and waits for full audio responses. The bidirectional API sends each word as it arrives, allowing Amazon Polly to begin synthesis immediately.
Technical Details
The bidirectional streaming API eliminates the need for application-level text separation logic and complex audio reassembly that previously required multiple parallel API calls.
Supported SDKs include:
- AWS SDK for Java 2.x, JavaScript v3, .NET v4
- C++, Go v2, Kotlin, PHP v3, Ruby v3, Rust, Swift
Not currently supported: Python, .NET v3, AWS CLI v1/v2, and PowerShell.
Developers can use a reactive streams Publisher to send TextEvent objects containing text, and handle incoming AudioEvent objects through a visitor pattern response handler.
What This Means
The bidirectional streaming API significantly reduces end-to-end latency for conversational AI by eliminating the architectural bottleneck of waiting for complete text before synthesis begins. The 39% latency reduction and 27x reduction in API calls represents meaningful improvement for real-time applications like virtual assistants and interactive chatbots. The feature trades API simplicity—developers previously using sentence buffering workarounds will appreciate the native solution—for measurable performance gains. Availability is limited to specific SDK languages, which may slow enterprise adoption initially.
Related Articles
AWS Launches WebRTC Integration for Amazon Nova Sonic Real-Time Voice Streaming
AWS has integrated WebRTC protocol support with Amazon Nova Sonic, its speech-to-speech model, through Amazon Kinesis Video Streams. The integration delivers real-time voice streaming with sub-second latency and includes adaptive bitrate control, forward error correction, and Voice Activity Detection for mobile and IoT applications.
Microsoft Edge adds Copilot feature to analyze content across all open browser tabs
Microsoft is updating Edge to let Copilot read and analyze content across all open browser tabs simultaneously. The update includes AI-generated podcasts from tabs, study mode with quizzes, and long-term conversation memory.
Notion launches Developer Platform with custom code execution, agent orchestration, and database sync
Notion has launched a Developer Platform that allows teams to run custom code in cloud-based Workers, sync external databases, and orchestrate both internal and external AI agents. The platform, free through August, supports integration with Claude Code, Cursor, Codex, and Decagon, and uses Model Context Protocol for agent connectivity.
Google embeds Gemini across Android as multimodal agent before Apple's WWDC AI reveal
Google is rolling out Gemini-powered features across Android that enable the AI to understand screen context and complete multi-step tasks across apps, marking a shift from traditional assistant interactions to agentic capabilities. The updates will launch on Samsung Galaxy and Google Pixel phones this summer before expanding to watches, cars, and laptops.
Comments
Loading...