product updateAmazon Web Services

AWS launches real-time voice agent framework combining Stream Vision Agents with Nova 2 Sonic

TL;DR

Amazon has released Stream's Vision Agents, an open-source Python framework for building real-time voice AI agents that integrates with Amazon Nova 2 Sonic through Bedrock. The system delivers end-to-end latency under 500 milliseconds using Stream's global edge network with sub-30ms audio latency and typically sub-500ms join times.

2 min read
0

AWS launches real-time voice agent framework combining Stream Vision Agents with Nova 2 Sonic

Amazon has released Stream's Vision Agents, an open-source Python framework for building production-ready voice AI agents that integrates with Amazon Nova 2 Sonic through Amazon Bedrock. The system targets end-to-end latency under 500 milliseconds for real-time voice interactions.

Technical architecture

The framework combines three components: Stream's Vision Agents handles orchestration and provides client SDKs for React, iOS, Android, Flutter, and React Native; Amazon Nova 2 Sonic provides speech-to-speech processing through Bedrock's real-time API; and Stream's Edge Network delivers the real-time transport layer with typically sub-500ms join times and under 30ms audio latency.

The system separates concerns by keeping Amazon Nova Sonic running in the customer's AWS account while Stream's infrastructure handles real-time media transport. Audio flows through Stream's globally distributed SFU (Selective Forwarding Unit) nodes, which terminate WebRTC connections and forward audio tracks to Vision Agent worker processes.

Audio pipeline details

Incoming speech is decoded to raw PCM by Vision Agent workers and streamed to Nova 2 Sonic via Bedrock's real-time API. Response audio frames are re-encoded, packetized as RTP, and delivered back through the SFU to client devices. The framework uses voice activity detection (VAD) in the worker to detect speech boundaries and barge-in events, while browser-based echo cancellation prevents the agent's output from retriggering the VAD loop.

Audio is transmitted as RTP over UDP rather than TCP to ensure predictable low latency and avoid head-of-line blocking. Stream's SFU handles bandwidth estimation, simulcast, and NAT traversal.

Capabilities and integration

Vision Agents provides a plugin-based architecture with 25+ integrations and supports function calling for API-driven actions. The framework includes automatic reconnection logic, session management, and graceful degradation for production deployment. Nova 2 Sonic handles the full speech-to-speech pipeline within a single model, eliminating the need for separate speech-to-text and text-to-speech services.

The system supports multilingual voice interactions and maintains full conversational context during barge-in scenarios. Developers can use Stream's global edge network or integrate their preferred real-time communication provider through a decorator-based interface.

Deployment model

The architecture maintains clear account boundaries: customer AWS accounts handle business logic, orchestration, and Bedrock integration, while Stream's AWS infrastructure manages the WebRTC/SFU media plane, TURN/STUN services, and signaling. Vision Agent runtime processes run as worker processes that terminate WebRTC as robot peers and bridge the customer's Bedrock integration.

What this means

This release addresses the infrastructure complexity of building production voice agents by providing an open-source framework that handles WebRTC management, audio streaming, and session lifecycle. The sub-500ms end-to-end latency target makes conversational AI interactions feel natural, while the separation between Stream's transport layer and customer-controlled Nova 2 Sonic deployments addresses data sovereignty concerns. The framework's 25+ integrations and multi-platform SDKs lower the barrier to deploying voice agents across web, mobile, and desktop applications.

Related Articles

product update

AWS Bedrock AgentCore adds Chrome enterprise policy support with 450+ browser settings

Amazon Bedrock AgentCore Browser now supports Chrome enterprise policies and custom root CA certificates, giving organizations control over 450+ browser settings for AI agents. The feature enables URL filtering, download restrictions, password manager controls, and connectivity to internal services through custom certificate authorities.

product update

AWS Launches WebRTC Integration for Amazon Nova Sonic Real-Time Voice Streaming

AWS has integrated WebRTC protocol support with Amazon Nova Sonic, its speech-to-speech model, through Amazon Kinesis Video Streams. The integration delivers real-time voice streaming with sub-second latency and includes adaptive bitrate control, forward error correction, and Voice Activity Detection for mobile and IoT applications.

product update

Amazon replaces Rufus with Alexa for Shopping assistant, adds cross-retailer purchasing

Amazon has launched Alexa for Shopping, replacing its Rufus generative AI assistant launched in 2024. The new assistant, powered by Alexa+, can make purchases across Amazon and other online retailers using a "Buy for Me" feature, and is now available to U.S. customers.

product update

Google names upcoming Gemini AI agent 'Spark,' adds autonomous task execution to mobile app

Google is preparing to launch Gemini Spark, an autonomous AI agent that will operate within the Gemini mobile app. According to code found in Google app beta version 17.23, Spark can access connected apps, personal data, and location to execute tasks like managing inboxes and scheduling meetings, though Google warns it may occasionally act without permission.

Comments

Loading...