product updateAmazon Web Services

Amazon Nova 2 Sonic Unifies Speech Recognition, Reasoning, and TTS in Single Streaming Model

TL;DR

Amazon Web Services released technical guidance for migrating text agents to voice assistants using Amazon Nova 2 Sonic, a native speech-to-speech model that combines automatic speech recognition, reasoning, tool calling, and text-to-speech in a single bidirectional streaming interface. The model supports asynchronous tool calling and built-in voice activity detection for handling interruptions.

2 min read
0

Amazon Web Services released technical guidance for migrating text agents to voice assistants using Amazon Nova 2 Sonic, a native speech-to-speech model that unifies automatic speech recognition (ASR), reasoning, tool use, and text-to-speech (TTS) in one bidirectional streaming interface.

Unlike traditional voice agent architectures that chain separate ASR → LLM → TTS components, Nova 2 Sonic handles the entire voice pipeline in a single model. The architecture accepts both text and audio inputs through the same interface, allowing teams to reuse existing prompts and tools from text agents while eliminating the need for a separate text reasoning model in the voice stack.

Architecture and capabilities

According to AWS, Nova 2 Sonic includes built-in voice activity detection (VAD) and turn detection, managing conversation context internally without requiring full history to be sent on each turn. The model supports asynchronous tool calling, enabling conversations to continue naturally while tools run in the background. It can run multiple tools in parallel and adapts if users change requests mid-process.

AWS identifies latency as a critical difference between text and voice agents. Text agents have mid-latency tolerance of a few seconds with loading indicators. Voice agents require response times in the hundreds of milliseconds, with delays of even a few seconds during tool calls feeling unresponsive to users. Each tool call adds noticeable silence in voice interactions.

Implementation requirements

The migration requires changes across three architectural components. Client applications need persistent bidirectional connections (WebSocket or WebRTC) and must handle audio encoding/decoding, client events, barge-in logic, and noise control—significantly more complex than stateless REST interfaces used by text clients.

Orchestrators in voice agents add audio streaming, VAD, ASR, reasoning, and TTS to the system prompt management and tool routing handled in text agents. Nova 2 Sonic's unified interface allows teams to migrate reasoning prompts and tool triggers directly from existing text agents.

Response design also shifts fundamentally. Text agents deliver paragraphs with rich formatting, lists, and links that users can read at their own pace. Voice agents require conversational, concise responses structured for listening. For example, a banking text agent might display full account summaries with formatted lists, while a voice agent would break information into digestible chunks and ask for confirmation before continuing.

Availability

AWS published a sample repository with a skill that works with AI IDEs like Kiro and Claude Code to automatically convert text agents into voice agents. Pricing for Nova 2 Sonic was not disclosed in the announcement.

What this means

Nova 2 Sonic represents AWS's push into native speech-to-speech models that compete with OpenAI's Realtime API and similar offerings. By unifying the voice pipeline in a single model rather than chaining components, AWS claims to reduce latency and architectural complexity. The asynchronous tool calling and built-in conversation management address key pain points in voice agent development, though real-world performance metrics and benchmark scores have not been published. The lack of disclosed pricing makes cost comparison with existing voice agent architectures difficult.

Related Articles

product update

AWS launches Neuron Agentic Development for automated Trainium kernel optimization

AWS announced Neuron Agentic Development, a collection of AI agents that automate kernel optimization for Trainium and Inferentia chips. The toolkit includes five specialized skills that handle kernel writing, debugging, profiling, and analysis, accessible through coding agents in Kiro and Claude.

product update

Google launches Gemini 3.5 Live Translate with continuous speech-to-speech in 70+ languages

Google announced Gemini 3.5 Live Translate, a speech-to-speech translation model supporting over 70 languages with continuous audio generation. The model rolls out today to Google Translate on Android and iOS, with Google Meet integration coming in private preview this month for select Workspace customers.

product update

GitHub Copilot CLI reduces unnecessary model handoffs with improved orchestration logic

GitHub has updated Copilot CLI to reduce unnecessary handoffs between AI models. The improvement delivers faster command execution through better orchestration logic, implemented without adding new user configuration options.

product update

GitHub Copilot CLI reduces unnecessary LLM handoffs through improved orchestration logic

GitHub has updated the orchestration logic in Copilot CLI to make it more selective about when to delegate tasks between language models. The changes reduce unnecessary handoffs and improve response times without introducing additional configuration settings.

Comments

Loading...