AWS launches real-time voice agent framework combining Stream Vision Agents with Nova 2 Sonic
Amazon has released Stream's Vision Agents, an open-source Python framework for building real-time voice AI agents that integrates with Amazon Nova 2 Sonic through Bedrock. The system delivers end-to-end latency under 500 milliseconds using Stream's global edge network with sub-30ms audio latency and typically sub-500ms join times.
AWS launches real-time voice agent framework combining Stream Vision Agents with Nova 2 Sonic
Amazon has released Stream's Vision Agents, an open-source Python framework for building production-ready voice AI agents that integrates with Amazon Nova 2 Sonic through Amazon Bedrock. The system targets end-to-end latency under 500 milliseconds for real-time voice interactions.
Technical architecture
The framework combines three components: Stream's Vision Agents handles orchestration and provides client SDKs for React, iOS, Android, Flutter, and React Native; Amazon Nova 2 Sonic provides speech-to-speech processing through Bedrock's real-time API; and Stream's Edge Network delivers the real-time transport layer with typically sub-500ms join times and under 30ms audio latency.
The system separates concerns by keeping Amazon Nova Sonic running in the customer's AWS account while Stream's infrastructure handles real-time media transport. Audio flows through Stream's globally distributed SFU (Selective Forwarding Unit) nodes, which terminate WebRTC connections and forward audio tracks to Vision Agent worker processes.
Audio pipeline details
Incoming speech is decoded to raw PCM by Vision Agent workers and streamed to Nova 2 Sonic via Bedrock's real-time API. Response audio frames are re-encoded, packetized as RTP, and delivered back through the SFU to client devices. The framework uses voice activity detection (VAD) in the worker to detect speech boundaries and barge-in events, while browser-based echo cancellation prevents the agent's output from retriggering the VAD loop.
Audio is transmitted as RTP over UDP rather than TCP to ensure predictable low latency and avoid head-of-line blocking. Stream's SFU handles bandwidth estimation, simulcast, and NAT traversal.
Capabilities and integration
Vision Agents provides a plugin-based architecture with 25+ integrations and supports function calling for API-driven actions. The framework includes automatic reconnection logic, session management, and graceful degradation for production deployment. Nova 2 Sonic handles the full speech-to-speech pipeline within a single model, eliminating the need for separate speech-to-text and text-to-speech services.
The system supports multilingual voice interactions and maintains full conversational context during barge-in scenarios. Developers can use Stream's global edge network or integrate their preferred real-time communication provider through a decorator-based interface.
Deployment model
The architecture maintains clear account boundaries: customer AWS accounts handle business logic, orchestration, and Bedrock integration, while Stream's AWS infrastructure manages the WebRTC/SFU media plane, TURN/STUN services, and signaling. Vision Agent runtime processes run as worker processes that terminate WebRTC as robot peers and bridge the customer's Bedrock integration.
What this means
This release addresses the infrastructure complexity of building production voice agents by providing an open-source framework that handles WebRTC management, audio streaming, and session lifecycle. The sub-500ms end-to-end latency target makes conversational AI interactions feel natural, while the separation between Stream's transport layer and customer-controlled Nova 2 Sonic deployments addresses data sovereignty concerns. The framework's 25+ integrations and multi-platform SDKs lower the barrier to deploying voice agents across web, mobile, and desktop applications.
Related Articles
AWS releases healthcare appointment agent tutorial using Nova 2 Sonic speech-to-speech model
AWS published a technical guide for building voice appointment agents using Amazon Nova 2 Sonic, a speech-to-speech model that processes audio natively without separate transcription steps. The tutorial covers authentication, scheduling, and escalation tools running on Amazon Bedrock AgentCore with DynamoDB persistence.
US government authorizes Anthropic to restore Mythos 5 cybersecurity model to 100+ institutions
The US government has authorized Anthropic to redeploy its Mythos 5 cybersecurity AI model to more than 100 US institutions, including major corporations and government agencies, following a two-week suspension. Commerce Secretary Howard Lutnick approved the redeployment after Anthropic implemented safeguards and committed to work with the government on release protocols.
Trump Administration Permits Anthropic's Claude Mythos 5 for 100+ US Organizations After Two-Week Ban
The Trump administration is allowing Anthropic to deploy Claude Mythos 5 to over 100 specific US government agencies and companies, two weeks after banning the cybersecurity model. Commerce Secretary Howard Lutnick approved access for organizations operating critical infrastructure, including non-American employees, though Fable 5 remains unavailable.
Google integrates Gemini AI into Play Store for conversational app discovery and in-app purchases
Google has rolled out Gemini integration with the Play Store on Android, allowing users to discover and install apps through conversational queries. The feature also enables purchasing in-app items and gift cards through chat, with support expanding to more apps over time.
Comments
Loading...