product updateAmazon Web Services

AWS SageMaker AI adds bidirectional streaming for real-time speech transcription with vLLM

TL;DR

Amazon SageMaker AI has launched bidirectional streaming support for real-time inference, enabling WebSocket-based voice applications through vLLM integration. The feature uses HTTP/2 on port 8443 to bridge client connections with vLLM's Realtime API, allowing audio to stream in while transcription streams back simultaneously over a single persistent connection.

2 min read
0

AWS SageMaker AI adds bidirectional streaming for real-time speech transcription with vLLM

Amazon SageMaker AI launched bidirectional streaming support for real-time inference in November 2025, according to an AWS blog post. The feature enables persistent, full-duplex connections between clients and model containers over HTTP/2, specifically targeting real-time speech-to-text applications.

Technical architecture

The implementation connects three layers:

Client to SageMaker AI: Applications connect to SageMaker AI runtime endpoints on port 8443 using HTTP/2. Each JSON message in vLLM's Realtime protocol is sent inside a RequestPayloadPart with DataType set to "UTF8", instructing SageMaker AI to forward data as WebSocket text frames.

SageMaker AI to container: SageMaker AI automatically bridges HTTP/2 event streams and WebSocket protocols. It establishes a WebSocket connection to containers at ws://localhost:8080/invocations-bidirectional-stream and forwards data frames bidirectionally.

Container layer: A FastAPI bridge listens on port 8080 and forwards connections to vLLM's Realtime API at ws://localhost:8081/v1/realtime. The bridge handles route translation between SageMaker AI's expected path and vLLM's native endpoint.

vLLM Realtime API protocol

vLLM's Realtime API requires audio encoded as base64 PCM16 at 16 kHz sample rate, mono channel. The protocol flow:

  1. Client connects to ws://host/v1/realtime
  2. Server sends session.created event
  3. Client sends input_audio_buffer.commit when ready
  4. Client streams input_audio_buffer.append events with base64 audio chunks
  5. Server streams transcription.delta events with incremental text
  6. Server sends transcription.done with final transcription and usage statistics

The model begins transcribing as soon as it has sufficient audio context, streaming tokens back while the client continues sending audio chunks.

Reference implementation

AWS provides a reference implementation using Mistral AI's Voxtral-Mini-4B-Realtime-2602 model. The example includes:

  • Custom Docker container built on SageMaker AI vLLM Deep Learning Container
  • Python client using SageMaker AI bidirectional streaming SDK
  • Gradio-based live microphone demo
  • Full code available in an AWS GitHub repository

vLLM applies piecewise CUDA graph execution to reduce GPU kernel launch overhead, directly reducing per-token latency during streaming transcription.

Infrastructure requirements

SageMaker AI handles connection management with WebSocket ping/pong keepalive frames, container health checks, and CloudWatch monitoring. The service eliminates the need for custom protocol translation layers or GPU server management.

What this means

This release removes a significant infrastructure barrier for deploying production voice AI applications on AWS. The automatic HTTP/2-to-WebSocket bridging and native vLLM integration eliminate the need for custom streaming infrastructure. For enterprises already using SageMaker AI, this provides a direct path to add real-time voice capabilities—voice agents, live captioning, contact center analytics—without migrating to specialized speech platforms. The open-source vLLM foundation prevents vendor lock-in on the serving layer while AWS handles operational complexity.

Related Articles

product update

AWS releases four multimodal evaluators for image-to-text AI tasks in Strands Evals SDK

AWS has added four multimodal evaluators to its Strands Evals SDK that judge image-to-text AI outputs by directly analyzing source images. The evaluators—Overall Quality, Correctness, Faithfulness, and Instruction Following—use multimodal large language models to detect visual hallucinations, factual errors, and instruction violations that text-only judges miss.

product update

Google launches Universal Cart, an AI agent that shops across multiple retailers in one checkout

Google announced Universal Cart at its I/O developer conference, an AI-powered shopping system that consolidates purchases from multiple retailers including Target, Shopify, Wayfair, and Etsy into a single checkout. The feature uses Gemini's agentic AI to verify product compatibility, suggest better deals, and automate routine purchases.

product update

Google Announces Gemini Spark Agent and Antigravity Platform at I/O, Launch Date Not Disclosed

Google announced Gemini Spark at I/O 2026, positioning it as a competitor to OpenAI's Claude-based agents. The service will integrate with Gmail, Calendar, Drive, and other Google apps, running on Gemini 3.5 Flash and a new platform called Antigravity. No general availability date has been disclosed.

product update

llm-gemini Plugin Adds Support for Google's Gemini 3.5 Flash Model

Developer Simon Willison released version 0.32 of the llm-gemini plugin, which adds support for Google's Gemini 3.5 Flash model. The plugin enables command-line access to Google's Gemini model family through the LLM tool.

Comments

Loading...