AWS SageMaker AI adds bidirectional streaming for real-time speech transcription with vLLM
Amazon SageMaker AI has launched bidirectional streaming support for real-time inference, enabling WebSocket-based voice applications through vLLM integration. The feature uses HTTP/2 on port 8443 to bridge client connections with vLLM's Realtime API, allowing audio to stream in while transcription streams back simultaneously over a single persistent connection.
AWS SageMaker AI adds bidirectional streaming for real-time speech transcription with vLLM
Amazon SageMaker AI launched bidirectional streaming support for real-time inference in November 2025, according to an AWS blog post. The feature enables persistent, full-duplex connections between clients and model containers over HTTP/2, specifically targeting real-time speech-to-text applications.
Technical architecture
The implementation connects three layers:
Client to SageMaker AI: Applications connect to SageMaker AI runtime endpoints on port 8443 using HTTP/2. Each JSON message in vLLM's Realtime protocol is sent inside a RequestPayloadPart with DataType set to "UTF8", instructing SageMaker AI to forward data as WebSocket text frames.
SageMaker AI to container: SageMaker AI automatically bridges HTTP/2 event streams and WebSocket protocols. It establishes a WebSocket connection to containers at ws://localhost:8080/invocations-bidirectional-stream and forwards data frames bidirectionally.
Container layer: A FastAPI bridge listens on port 8080 and forwards connections to vLLM's Realtime API at ws://localhost:8081/v1/realtime. The bridge handles route translation between SageMaker AI's expected path and vLLM's native endpoint.
vLLM Realtime API protocol
vLLM's Realtime API requires audio encoded as base64 PCM16 at 16 kHz sample rate, mono channel. The protocol flow:
- Client connects to
ws://host/v1/realtime - Server sends
session.createdevent - Client sends
input_audio_buffer.commitwhen ready - Client streams
input_audio_buffer.appendevents with base64 audio chunks - Server streams
transcription.deltaevents with incremental text - Server sends
transcription.donewith final transcription and usage statistics
The model begins transcribing as soon as it has sufficient audio context, streaming tokens back while the client continues sending audio chunks.
Reference implementation
AWS provides a reference implementation using Mistral AI's Voxtral-Mini-4B-Realtime-2602 model. The example includes:
- Custom Docker container built on SageMaker AI vLLM Deep Learning Container
- Python client using SageMaker AI bidirectional streaming SDK
- Gradio-based live microphone demo
- Full code available in an AWS GitHub repository
vLLM applies piecewise CUDA graph execution to reduce GPU kernel launch overhead, directly reducing per-token latency during streaming transcription.
Infrastructure requirements
SageMaker AI handles connection management with WebSocket ping/pong keepalive frames, container health checks, and CloudWatch monitoring. The service eliminates the need for custom protocol translation layers or GPU server management.
What this means
This release removes a significant infrastructure barrier for deploying production voice AI applications on AWS. The automatic HTTP/2-to-WebSocket bridging and native vLLM integration eliminate the need for custom streaming infrastructure. For enterprises already using SageMaker AI, this provides a direct path to add real-time voice capabilities—voice agents, live captioning, contact center analytics—without migrating to specialized speech platforms. The open-source vLLM foundation prevents vendor lock-in on the serving layer while AWS handles operational complexity.
Related Articles
AWS releases four multimodal evaluators for image-to-text AI tasks in Strands Evals SDK
AWS has added four multimodal evaluators to its Strands Evals SDK that judge image-to-text AI outputs by directly analyzing source images. The evaluators—Overall Quality, Correctness, Faithfulness, and Instruction Following—use multimodal large language models to detect visual hallucinations, factual errors, and instruction violations that text-only judges miss.
Google launches Universal Cart, an AI agent that shops across multiple retailers in one checkout
Google announced Universal Cart at its I/O developer conference, an AI-powered shopping system that consolidates purchases from multiple retailers including Target, Shopify, Wayfair, and Etsy into a single checkout. The feature uses Gemini's agentic AI to verify product compatibility, suggest better deals, and automate routine purchases.
Google Announces Gemini Spark Agent and Antigravity Platform at I/O, Launch Date Not Disclosed
Google announced Gemini Spark at I/O 2026, positioning it as a competitor to OpenAI's Claude-based agents. The service will integrate with Gmail, Calendar, Drive, and other Google apps, running on Gemini 3.5 Flash and a new platform called Antigravity. No general availability date has been disclosed.
llm-gemini Plugin Adds Support for Google's Gemini 3.5 Flash Model
Developer Simon Willison released version 0.32 of the llm-gemini plugin, which adds support for Google's Gemini 3.5 Flash model. The plugin enables command-line access to Google's Gemini model family through the LLM tool.
Comments
Loading...