AWS SageMaker AI adds bidirectional streaming for real-time speech transcription with vLLM

TL;DR

Amazon SageMaker AI has launched bidirectional streaming support for real-time inference, enabling WebSocket-based voice applications through vLLM integration. The feature uses HTTP/2 on port 8443 to bridge client connections with vLLM's Realtime API, allowing audio to stream in while transcription streams back simultaneously over a single persistent connection.

May 20, 2026 · 5:20 PM2 min read

AWS SageMaker AI adds bidirectional streaming for real-time speech transcription with vLLM

Amazon SageMaker AI launched bidirectional streaming support for real-time inference in November 2025, according to an AWS blog post. The feature enables persistent, full-duplex connections between clients and model containers over HTTP/2, specifically targeting real-time speech-to-text applications.

Technical architecture

The implementation connects three layers:

Client to SageMaker AI: Applications connect to SageMaker AI runtime endpoints on port 8443 using HTTP/2. Each JSON message in vLLM's Realtime protocol is sent inside a RequestPayloadPart with DataType set to "UTF8", instructing SageMaker AI to forward data as WebSocket text frames.

SageMaker AI to container: SageMaker AI automatically bridges HTTP/2 event streams and WebSocket protocols. It establishes a WebSocket connection to containers at ws://localhost:8080/invocations-bidirectional-stream and forwards data frames bidirectionally.

Container layer: A FastAPI bridge listens on port 8080 and forwards connections to vLLM's Realtime API at ws://localhost:8081/v1/realtime. The bridge handles route translation between SageMaker AI's expected path and vLLM's native endpoint.

vLLM Realtime API protocol

vLLM's Realtime API requires audio encoded as base64 PCM16 at 16 kHz sample rate, mono channel. The protocol flow:

Client connects to ws://host/v1/realtime
Server sends session.created event
Client sends input_audio_buffer.commit when ready
Client streams input_audio_buffer.append events with base64 audio chunks
Server streams transcription.delta events with incremental text
Server sends transcription.done with final transcription and usage statistics

The model begins transcribing as soon as it has sufficient audio context, streaming tokens back while the client continues sending audio chunks.

Reference implementation

AWS provides a reference implementation using Mistral AI's Voxtral-Mini-4B-Realtime-2602 model. The example includes:

Custom Docker container built on SageMaker AI vLLM Deep Learning Container
Python client using SageMaker AI bidirectional streaming SDK
Gradio-based live microphone demo
Full code available in an AWS GitHub repository

vLLM applies piecewise CUDA graph execution to reduce GPU kernel launch overhead, directly reducing per-token latency during streaming transcription.

Infrastructure requirements

SageMaker AI handles connection management with WebSocket ping/pong keepalive frames, container health checks, and CloudWatch monitoring. The service eliminates the need for custom protocol translation layers or GPU server management.

What this means

This release removes a significant infrastructure barrier for deploying production voice AI applications on AWS. The automatic HTTP/2-to-WebSocket bridging and native vLLM integration eliminate the need for custom streaming infrastructure. For enterprises already using SageMaker AI, this provides a direct path to add real-time voice capabilities—voice agents, live captioning, contact center analytics—without migrating to specialized speech platforms. The open-source vLLM foundation prevents vendor lock-in on the serving layer while AWS handles operational complexity.

Source: aws.amazon.com ↗

AWS SageMaker vLLM streaming speech-to-text WebSocket real-time-inference voice-AI

product updateJune 30, 2026

AWS enables fine-tuning of Amazon Nova models for email extraction, achieving 94.77% accuracy with 50% cost reduction

AWS released guidance on fine-tuning Amazon Nova Micro and Nova Lite models for automated email data extraction using SageMaker AI. In collaboration with Parcel Perform, the fine-tuned Nova Micro achieved 94.77% extraction accuracy—a 16.6 percentage point improvement—while reducing inference costs by 50% and latency by 30% compared to previous models.

product updateJuly 1, 2026

AWS brings NVIDIA Nemotron and OpenAI GPT OSS models to GovCloud for secure government AI workloads

Amazon Bedrock now supports NVIDIA Nemotron and OpenAI GPT OSS models in AWS GovCloud (US) Regions. The launch includes OpenAI's GPT OSS models (120B and 20B parameters, 128K context) and NVIDIA Nemotron 3 family (9B to 120B parameters, 1M context), providing government agencies FedRAMP High and DoD SRG Level 5-compliant AI inference on U.S. soil.

product updateJuly 1, 2026

AWS adds metadata filtering to AgentCore Memory, improving agent retrieval accuracy from 40% to 64%

Amazon has added metadata filtering to its AgentCore Memory service for AI agents. In AWS evaluations across 151 questions, the feature improved overall question-answering accuracy from 40% to 64%, with context-dependent questions jumping from 16% to 69% accuracy. The update allows agents to filter memory retrieval by attributes like priority, department, or time range before semantic search runs.

product updateJuly 1, 2026

AWS to Release Anthropic's Claude Fable 5 on Bedrock with Cybersecurity Guardrails

Amazon Web Services announced it will make Anthropic's Claude Fable 5 models available on Bedrock starting tomorrow, featuring guardrails designed to prevent cybersecurity misuse. When guardrails are triggered, the system automatically falls back to Claude Opus 4.8.

AWS SageMaker AI adds bidirectional streaming for real-time speech transcription with vLLM

AWS SageMaker AI adds bidirectional streaming for real-time speech transcription with vLLM

Technical architecture

vLLM Realtime API protocol

Reference implementation

Infrastructure requirements

What this means

Related Articles

AWS enables fine-tuning of Amazon Nova models for email extraction, achieving 94.77% accuracy with 50% cost reduction

AWS brings NVIDIA Nemotron and OpenAI GPT OSS models to GovCloud for secure government AI workloads

AWS adds metadata filtering to AgentCore Memory, improving agent retrieval accuracy from 40% to 64%

AWS to Release Anthropic's Claude Fable 5 on Bedrock with Cybersecurity Guardrails

Comments