product updateAmazon Web Services

AWS SageMaker AI adds bidirectional streaming for real-time speech transcription with vLLM

TL;DR

Amazon SageMaker AI has launched bidirectional streaming support for real-time inference, enabling WebSocket-based voice applications through vLLM integration. The feature uses HTTP/2 on port 8443 to bridge client connections with vLLM's Realtime API, allowing audio to stream in while transcription streams back simultaneously over a single persistent connection.

2 min read
0

AWS SageMaker AI adds bidirectional streaming for real-time speech transcription with vLLM

Amazon SageMaker AI launched bidirectional streaming support for real-time inference in November 2025, according to an AWS blog post. The feature enables persistent, full-duplex connections between clients and model containers over HTTP/2, specifically targeting real-time speech-to-text applications.

Technical architecture

The implementation connects three layers:

Client to SageMaker AI: Applications connect to SageMaker AI runtime endpoints on port 8443 using HTTP/2. Each JSON message in vLLM's Realtime protocol is sent inside a RequestPayloadPart with DataType set to "UTF8", instructing SageMaker AI to forward data as WebSocket text frames.

SageMaker AI to container: SageMaker AI automatically bridges HTTP/2 event streams and WebSocket protocols. It establishes a WebSocket connection to containers at ws://localhost:8080/invocations-bidirectional-stream and forwards data frames bidirectionally.

Container layer: A FastAPI bridge listens on port 8080 and forwards connections to vLLM's Realtime API at ws://localhost:8081/v1/realtime. The bridge handles route translation between SageMaker AI's expected path and vLLM's native endpoint.

vLLM Realtime API protocol

vLLM's Realtime API requires audio encoded as base64 PCM16 at 16 kHz sample rate, mono channel. The protocol flow:

  1. Client connects to ws://host/v1/realtime
  2. Server sends session.created event
  3. Client sends input_audio_buffer.commit when ready
  4. Client streams input_audio_buffer.append events with base64 audio chunks
  5. Server streams transcription.delta events with incremental text
  6. Server sends transcription.done with final transcription and usage statistics

The model begins transcribing as soon as it has sufficient audio context, streaming tokens back while the client continues sending audio chunks.

Reference implementation

AWS provides a reference implementation using Mistral AI's Voxtral-Mini-4B-Realtime-2602 model. The example includes:

  • Custom Docker container built on SageMaker AI vLLM Deep Learning Container
  • Python client using SageMaker AI bidirectional streaming SDK
  • Gradio-based live microphone demo
  • Full code available in an AWS GitHub repository

vLLM applies piecewise CUDA graph execution to reduce GPU kernel launch overhead, directly reducing per-token latency during streaming transcription.

Infrastructure requirements

SageMaker AI handles connection management with WebSocket ping/pong keepalive frames, container health checks, and CloudWatch monitoring. The service eliminates the need for custom protocol translation layers or GPU server management.

What this means

This release removes a significant infrastructure barrier for deploying production voice AI applications on AWS. The automatic HTTP/2-to-WebSocket bridging and native vLLM integration eliminate the need for custom streaming infrastructure. For enterprises already using SageMaker AI, this provides a direct path to add real-time voice capabilities—voice agents, live captioning, contact center analytics—without migrating to specialized speech platforms. The open-source vLLM foundation prevents vendor lock-in on the serving layer while AWS handles operational complexity.

Related Articles

product update

AWS enables fine-tuning of Amazon Nova models for email extraction, achieving 94.77% accuracy with 50% cost reduction

AWS released guidance on fine-tuning Amazon Nova Micro and Nova Lite models for automated email data extraction using SageMaker AI. In collaboration with Parcel Perform, the fine-tuned Nova Micro achieved 94.77% extraction accuracy—a 16.6 percentage point improvement—while reducing inference costs by 50% and latency by 30% compared to previous models.

product update

AWS brings NVIDIA Nemotron and OpenAI GPT OSS models to GovCloud for secure government AI workloads

Amazon Bedrock now supports NVIDIA Nemotron and OpenAI GPT OSS models in AWS GovCloud (US) Regions. The launch includes OpenAI's GPT OSS models (120B and 20B parameters, 128K context) and NVIDIA Nemotron 3 family (9B to 120B parameters, 1M context), providing government agencies FedRAMP High and DoD SRG Level 5-compliant AI inference on U.S. soil.

product update

AWS adds metadata filtering to AgentCore Memory, improving agent retrieval accuracy from 40% to 64%

Amazon has added metadata filtering to its AgentCore Memory service for AI agents. In AWS evaluations across 151 questions, the feature improved overall question-answering accuracy from 40% to 64%, with context-dependent questions jumping from 16% to 69% accuracy. The update allows agents to filter memory retrieval by attributes like priority, department, or time range before semantic search runs.

product update

AWS to Release Anthropic's Claude Fable 5 on Bedrock with Cybersecurity Guardrails

Amazon Web Services announced it will make Anthropic's Claude Fable 5 models available on Bedrock starting tomorrow, featuring guardrails designed to prevent cybersecurity misuse. When guardrails are triggered, the system automatically falls back to Claude Opus 4.8.

Comments

Loading...