product updateAmazon Web Services

Amazon Polly adds bidirectional streaming API for real-time speech synthesis in conversational AI

TL;DR

Amazon has released a new Bidirectional Streaming API for Amazon Polly that enables simultaneous text input and audio output over a single HTTP/2 connection. The API reduces end-to-end latency by 39% compared to traditional request-response TTS by allowing text to be sent word-by-word as LLMs generate tokens, rather than waiting for complete sentences. The feature is available in Java, JavaScript, .NET, C++, Go, Kotlin, PHP, Ruby, Rust, and Swift SDKs.

2 min read
0

Amazon Polly Adds Real-Time Bidirectional Streaming for Conversational AI

Amazon has released a new Bidirectional Streaming API for Amazon Polly that enables real-time text-to-speech synthesis where text and audio flow simultaneously over a single connection.

The Problem with Traditional TTS

Conventional text-to-speech APIs operate in request-response mode: developers must collect the complete text before making a synthesis request. For conversational AI applications powered by large language models (LLMs)—which generate text token-by-token—this creates a bottleneck. Users must wait for:

  1. The LLM to finish generating the complete response
  2. The TTS service to synthesize the entire text
  3. Audio to download before playback begins

Amazon Polly previously supported streaming audio output, but required complete input text upfront.

How Bidirectional Streaming Works

The new StartSpeechSynthesisStream API introduces true duplex communication:

  • Send text incrementally: Stream text to Amazon Polly as it becomes available, word-by-word
  • Receive audio immediately: Get synthesized audio bytes back in real-time as they're generated
  • Control timing: Use flush configuration to trigger synthesis of buffered text
  • Single connection: HTTP/2 enables simultaneous bidirectional flow

Key components include TextEvent (client → service), CloseStreamEvent (client → service), AudioEvent (service → client), and StreamClosedEvent (service → client).

Performance Improvements

Amazon benchmarked the bidirectional API against the traditional SynthesizeSpeech API using identical test conditions: 7,045 characters of prose (970 words) with the Matthew voice, Generative engine, MP3 output at 24kHz.

Simulation conditions: LLM generating tokens at ~30ms per word.

Metric Traditional API Bidirectional Improvement
Total processing time 115,226 ms 70,071 ms 39% faster
API calls 27 1 27x reduction
Total audio bytes 2,354,292 2,324,636 Similar

The traditional API buffers words until sentence boundaries are reached, then sends complete sentences as separate requests and waits for full audio responses. The bidirectional API sends each word as it arrives, allowing Amazon Polly to begin synthesis immediately.

Technical Details

The bidirectional streaming API eliminates the need for application-level text separation logic and complex audio reassembly that previously required multiple parallel API calls.

Supported SDKs include:

  • AWS SDK for Java 2.x, JavaScript v3, .NET v4
  • C++, Go v2, Kotlin, PHP v3, Ruby v3, Rust, Swift

Not currently supported: Python, .NET v3, AWS CLI v1/v2, and PowerShell.

Developers can use a reactive streams Publisher to send TextEvent objects containing text, and handle incoming AudioEvent objects through a visitor pattern response handler.

What This Means

The bidirectional streaming API significantly reduces end-to-end latency for conversational AI by eliminating the architectural bottleneck of waiting for complete text before synthesis begins. The 39% latency reduction and 27x reduction in API calls represents meaningful improvement for real-time applications like virtual assistants and interactive chatbots. The feature trades API simplicity—developers previously using sentence buffering workarounds will appreciate the native solution—for measurable performance gains. Availability is limited to specific SDK languages, which may slow enterprise adoption initially.

Related Articles

product update

Suno v5.5 adds voice cloning and custom model training for Pro subscribers

Suno has released v5.5 of its AI music generation model, prioritizing user customization over quality improvements. The update includes Voices (voice cloning via user-uploaded recordings), Custom Models (style training on user music catalogs), and My Taste (preference learning), with voice and custom features limited to Pro and Premier subscribers.

product update

Replit Agent 4 overhauls design, collaboration, and build workflows for product teams

Replit has released Agent 4, a major update that fundamentally restructures how product teams design, collaborate, and build. The release replaces Design Mode with an infinite Design Canvas, moves from fork-and-merge to shared real-time collaboration, and enables concurrent planning and execution.

product update

Google Gemini adds memory import feature to ease switching from ChatGPT, Claude

Google has launched a memory import feature for Gemini that lets users transfer their chat history, memories, and preferences from competing AI services like ChatGPT and Claude. The feature works by exporting conversation data and personal context from your existing AI service, then importing it into Gemini to enable faster personalization.

product update

Suno 5.5 adds voice cloning and custom models for personalized AI music generation

Suno has released version 5.5 of its AI music generator with three new features: Voices (record/upload your singing voice for AI songs), Custom Models (fine-tune the model on your music library), and My Taste (personalized genre/mood recommendations). The update is available to Pro and Premier subscribers.

Comments

Loading...