Amazon Polly adds bidirectional streaming API for real-time speech synthesis in conversational AI
Amazon has released a new Bidirectional Streaming API for Amazon Polly that enables simultaneous text input and audio output over a single HTTP/2 connection. The API reduces end-to-end latency by 39% compared to traditional request-response TTS by allowing text to be sent word-by-word as LLMs generate tokens, rather than waiting for complete sentences. The feature is available in Java, JavaScript, .NET, C++, Go, Kotlin, PHP, Ruby, Rust, and Swift SDKs.
Amazon Polly Adds Real-Time Bidirectional Streaming for Conversational AI
Amazon has released a new Bidirectional Streaming API for Amazon Polly that enables real-time text-to-speech synthesis where text and audio flow simultaneously over a single connection.
The Problem with Traditional TTS
Conventional text-to-speech APIs operate in request-response mode: developers must collect the complete text before making a synthesis request. For conversational AI applications powered by large language models (LLMs)—which generate text token-by-token—this creates a bottleneck. Users must wait for:
- The LLM to finish generating the complete response
- The TTS service to synthesize the entire text
- Audio to download before playback begins
Amazon Polly previously supported streaming audio output, but required complete input text upfront.
How Bidirectional Streaming Works
The new StartSpeechSynthesisStream API introduces true duplex communication:
- Send text incrementally: Stream text to Amazon Polly as it becomes available, word-by-word
- Receive audio immediately: Get synthesized audio bytes back in real-time as they're generated
- Control timing: Use flush configuration to trigger synthesis of buffered text
- Single connection: HTTP/2 enables simultaneous bidirectional flow
Key components include TextEvent (client → service), CloseStreamEvent (client → service), AudioEvent (service → client), and StreamClosedEvent (service → client).
Performance Improvements
Amazon benchmarked the bidirectional API against the traditional SynthesizeSpeech API using identical test conditions: 7,045 characters of prose (970 words) with the Matthew voice, Generative engine, MP3 output at 24kHz.
Simulation conditions: LLM generating tokens at ~30ms per word.
| Metric | Traditional API | Bidirectional | Improvement |
|---|---|---|---|
| Total processing time | 115,226 ms | 70,071 ms | 39% faster |
| API calls | 27 | 1 | 27x reduction |
| Total audio bytes | 2,354,292 | 2,324,636 | Similar |
The traditional API buffers words until sentence boundaries are reached, then sends complete sentences as separate requests and waits for full audio responses. The bidirectional API sends each word as it arrives, allowing Amazon Polly to begin synthesis immediately.
Technical Details
The bidirectional streaming API eliminates the need for application-level text separation logic and complex audio reassembly that previously required multiple parallel API calls.
Supported SDKs include:
- AWS SDK for Java 2.x, JavaScript v3, .NET v4
- C++, Go v2, Kotlin, PHP v3, Ruby v3, Rust, Swift
Not currently supported: Python, .NET v3, AWS CLI v1/v2, and PowerShell.
Developers can use a reactive streams Publisher to send TextEvent objects containing text, and handle incoming AudioEvent objects through a visitor pattern response handler.
What This Means
The bidirectional streaming API significantly reduces end-to-end latency for conversational AI by eliminating the architectural bottleneck of waiting for complete text before synthesis begins. The 39% latency reduction and 27x reduction in API calls represents meaningful improvement for real-time applications like virtual assistants and interactive chatbots. The feature trades API simplicity—developers previously using sentence buffering workarounds will appreciate the native solution—for measurable performance gains. Availability is limited to specific SDK languages, which may slow enterprise adoption initially.
Related Articles
Google integrates Gemini AI into Play Store for conversational app discovery and in-app purchases
Google has rolled out Gemini integration with the Play Store on Android, allowing users to discover and install apps through conversational queries. The feature also enables purchasing in-app items and gift cards through chat, with support expanding to more apps over time.
Cline CLI v3.0.30 Adds Token Counter, SAP AI Core Support, and OpenRouter Improvements
Cline shipped CLI v3.0.30 on June 26, 2024, adding a token count display in the status bar alongside cost tracking. The update integrates SAP AI Core as a provider, refreshes the model catalog with latest provider models, and fixes OpenRouter prompt caching behavior.
Google expands Gemini Android overlay menu with six new tools accessible without opening app
Google has expanded the Gemini overlay plus menu on Android to include six tools: Videos, Music, Canvas, and Guided Learning join the existing Images and Personal Intelligence options. The update, rolling out in Google app version 17.32, allows users to access most Gemini features from anywhere on Android without opening the full app.
US government authorizes Anthropic to restore Mythos 5 cybersecurity model to 100+ institutions
The US government has authorized Anthropic to redeploy its Mythos 5 cybersecurity AI model to more than 100 US institutions, including major corporations and government agencies, following a two-week suspension. Commerce Secretary Howard Lutnick approved the redeployment after Anthropic implemented safeguards and committed to work with the government on release protocols.
Comments
Loading...