Amazon Nova 2 Sonic Unifies Speech Recognition, Reasoning, and TTS in Single Streaming Model
Amazon Web Services released technical guidance for migrating text agents to voice assistants using Amazon Nova 2 Sonic, a native speech-to-speech model that combines automatic speech recognition, reasoning, tool calling, and text-to-speech in a single bidirectional streaming interface. The model supports asynchronous tool calling and built-in voice activity detection for handling interruptions.
Amazon Web Services released technical guidance for migrating text agents to voice assistants using Amazon Nova 2 Sonic, a native speech-to-speech model that unifies automatic speech recognition (ASR), reasoning, tool use, and text-to-speech (TTS) in one bidirectional streaming interface.
Unlike traditional voice agent architectures that chain separate ASR → LLM → TTS components, Nova 2 Sonic handles the entire voice pipeline in a single model. The architecture accepts both text and audio inputs through the same interface, allowing teams to reuse existing prompts and tools from text agents while eliminating the need for a separate text reasoning model in the voice stack.
Architecture and capabilities
According to AWS, Nova 2 Sonic includes built-in voice activity detection (VAD) and turn detection, managing conversation context internally without requiring full history to be sent on each turn. The model supports asynchronous tool calling, enabling conversations to continue naturally while tools run in the background. It can run multiple tools in parallel and adapts if users change requests mid-process.
AWS identifies latency as a critical difference between text and voice agents. Text agents have mid-latency tolerance of a few seconds with loading indicators. Voice agents require response times in the hundreds of milliseconds, with delays of even a few seconds during tool calls feeling unresponsive to users. Each tool call adds noticeable silence in voice interactions.
Implementation requirements
The migration requires changes across three architectural components. Client applications need persistent bidirectional connections (WebSocket or WebRTC) and must handle audio encoding/decoding, client events, barge-in logic, and noise control—significantly more complex than stateless REST interfaces used by text clients.
Orchestrators in voice agents add audio streaming, VAD, ASR, reasoning, and TTS to the system prompt management and tool routing handled in text agents. Nova 2 Sonic's unified interface allows teams to migrate reasoning prompts and tool triggers directly from existing text agents.
Response design also shifts fundamentally. Text agents deliver paragraphs with rich formatting, lists, and links that users can read at their own pace. Voice agents require conversational, concise responses structured for listening. For example, a banking text agent might display full account summaries with formatted lists, while a voice agent would break information into digestible chunks and ask for confirmation before continuing.
Availability
AWS published a sample repository with a skill that works with AI IDEs like Kiro and Claude Code to automatically convert text agents into voice agents. Pricing for Nova 2 Sonic was not disclosed in the announcement.
What this means
Nova 2 Sonic represents AWS's push into native speech-to-speech models that compete with OpenAI's Realtime API and similar offerings. By unifying the voice pipeline in a single model rather than chaining components, AWS claims to reduce latency and architectural complexity. The asynchronous tool calling and built-in conversation management address key pain points in voice agent development, though real-world performance metrics and benchmark scores have not been published. The lack of disclosed pricing makes cost comparison with existing voice agent architectures difficult.
Related Articles
OpenAI releases ChatGPT Images 2.0 with accurate text rendering and brand-style matching
OpenAI launched ChatGPT Images 2.0, upgrading from decorative images to full-page graphics with detailed text rendering. The update is available to all ChatGPT tiers, with advanced features requiring paid subscriptions that access the Thinking model. Hands-on testing shows significant improvements in text accuracy and brand-style replication, though factual errors still occur.
IBM releases Bob AI coding assistant after testing on 80,000 employees, claims 45% productivity gains
IBM has launched Bob, its AI coding assistant, following internal testing with 80,000 employees. The company claims teams saw average productivity gains of 45% across complex workflows. Pricing ranges from $20 to $200 per month using a "Bobcoin" credit system.
Amazon launches Quick desktop app with persistent context tracking across Google Workspace, Microsoft 365, Zoom, and Sal
Amazon has released a desktop version of its Quick AI assistant that integrates with Google Workspace, Microsoft 365, Zoom, and Salesforce, storing persistent context about user activities to automate tasks. The company also split Amazon Connect into four vertical-specific products: Connect Decisions, Connect Talent, Connect Health, and Connect Customer AI.
Google cuts Gemini voice assistant response time by 1.5 seconds for smart home controls
Google's Gemini for Home voice assistant now executes smart home commands up to 1.5 seconds faster for lights and plugs, the company announced. The update also brings near-instant processing for alarms, timers, and reminders, currently available for English, French, and Spanish users.
Comments
Loading...