Amazon Polly adds bidirectional streaming API for real-time speech synthesis in conversational AI
Amazon has released a new Bidirectional Streaming API for Amazon Polly that enables simultaneous text input and audio output over a single HTTP/2 connection. The API reduces end-to-end latency by 39% compared to traditional request-response TTS by allowing text to be sent word-by-word as LLMs generate tokens, rather than waiting for complete sentences. The feature is available in Java, JavaScript, .NET, C++, Go, Kotlin, PHP, Ruby, Rust, and Swift SDKs.
Amazon Polly Adds Real-Time Bidirectional Streaming for Conversational AI
Amazon has released a new Bidirectional Streaming API for Amazon Polly that enables real-time text-to-speech synthesis where text and audio flow simultaneously over a single connection.
The Problem with Traditional TTS
Conventional text-to-speech APIs operate in request-response mode: developers must collect the complete text before making a synthesis request. For conversational AI applications powered by large language models (LLMs)—which generate text token-by-token—this creates a bottleneck. Users must wait for:
- The LLM to finish generating the complete response
- The TTS service to synthesize the entire text
- Audio to download before playback begins
Amazon Polly previously supported streaming audio output, but required complete input text upfront.
How Bidirectional Streaming Works
The new StartSpeechSynthesisStream API introduces true duplex communication:
- Send text incrementally: Stream text to Amazon Polly as it becomes available, word-by-word
- Receive audio immediately: Get synthesized audio bytes back in real-time as they're generated
- Control timing: Use flush configuration to trigger synthesis of buffered text
- Single connection: HTTP/2 enables simultaneous bidirectional flow
Key components include TextEvent (client → service), CloseStreamEvent (client → service), AudioEvent (service → client), and StreamClosedEvent (service → client).
Performance Improvements
Amazon benchmarked the bidirectional API against the traditional SynthesizeSpeech API using identical test conditions: 7,045 characters of prose (970 words) with the Matthew voice, Generative engine, MP3 output at 24kHz.
Simulation conditions: LLM generating tokens at ~30ms per word.
| Metric | Traditional API | Bidirectional | Improvement |
|---|---|---|---|
| Total processing time | 115,226 ms | 70,071 ms | 39% faster |
| API calls | 27 | 1 | 27x reduction |
| Total audio bytes | 2,354,292 | 2,324,636 | Similar |
The traditional API buffers words until sentence boundaries are reached, then sends complete sentences as separate requests and waits for full audio responses. The bidirectional API sends each word as it arrives, allowing Amazon Polly to begin synthesis immediately.
Technical Details
The bidirectional streaming API eliminates the need for application-level text separation logic and complex audio reassembly that previously required multiple parallel API calls.
Supported SDKs include:
- AWS SDK for Java 2.x, JavaScript v3, .NET v4
- C++, Go v2, Kotlin, PHP v3, Ruby v3, Rust, Swift
Not currently supported: Python, .NET v3, AWS CLI v1/v2, and PowerShell.
Developers can use a reactive streams Publisher to send TextEvent objects containing text, and handle incoming AudioEvent objects through a visitor pattern response handler.
What This Means
The bidirectional streaming API significantly reduces end-to-end latency for conversational AI by eliminating the architectural bottleneck of waiting for complete text before synthesis begins. The 39% latency reduction and 27x reduction in API calls represents meaningful improvement for real-time applications like virtual assistants and interactive chatbots. The feature trades API simplicity—developers previously using sentence buffering workarounds will appreciate the native solution—for measurable performance gains. Availability is limited to specific SDK languages, which may slow enterprise adoption initially.
Related Articles
Suno v5.5 adds voice cloning and custom model training for Pro subscribers
Suno has released v5.5 of its AI music generation model, prioritizing user customization over quality improvements. The update includes Voices (voice cloning via user-uploaded recordings), Custom Models (style training on user music catalogs), and My Taste (preference learning), with voice and custom features limited to Pro and Premier subscribers.
Replit Agent 4 overhauls design, collaboration, and build workflows for product teams
Replit has released Agent 4, a major update that fundamentally restructures how product teams design, collaborate, and build. The release replaces Design Mode with an infinite Design Canvas, moves from fork-and-merge to shared real-time collaboration, and enables concurrent planning and execution.
Google Gemini adds memory import feature to ease switching from ChatGPT, Claude
Google has launched a memory import feature for Gemini that lets users transfer their chat history, memories, and preferences from competing AI services like ChatGPT and Claude. The feature works by exporting conversation data and personal context from your existing AI service, then importing it into Gemini to enable faster personalization.
Suno 5.5 adds voice cloning and custom models for personalized AI music generation
Suno has released version 5.5 of its AI music generator with three new features: Voices (record/upload your singing voice for AI songs), Custom Models (fine-tune the model on your music library), and My Taste (personalized genre/mood recommendations). The update is available to Pro and Premier subscribers.
Comments
Loading...