product updateAmazon Web Services

AWS releases open-source test harness for evaluating Amazon Nova Sonic voice agents at scale

TL;DR

Amazon has released an open-source testing framework for Nova Sonic voice agents that automates multi-turn conversation evaluation without requiring human testers. The harness uses LLM-as-judge techniques to assess voice agents across six metrics including goal achievement, response accuracy, and tool usage, addressing a critical QA bottleneck in voice AI development.

3 min read
0

AWS Releases Open-Source Test Harness for Evaluating Amazon Nova Sonic Voice Agents at Scale

Amazon has released an open-source testing framework that automates quality assurance for Nova Sonic voice agents, eliminating the need for manual conversation testing. The Nova Sonic Test Harness addresses a critical bottleneck in voice AI development: the inability to systematically test bidirectional audio streaming applications.

Testing Voice Agents Without Human Testers

The framework runs complete multi-turn conversations with Amazon Nova Sonic automatically using an LLM-powered user simulator. According to AWS, teams previously needed to manually test voice agents by having someone physically talk to the system—a process that doesn't scale when evaluating 50 conversation scenarios across 3 user personas, requiring 150 manual tests taking several minutes each.

The test harness uses LLM-as-judge evaluation techniques to assess conversations across six built-in metrics organized into three tiers:

Critical tier:

  • Goal Achievement (Did the conversation accomplish the user's objective?)
  • Response Accuracy (Were facts, numbers, and claims correct?)

Important tier:

  • Tool Usage (Were the right tools called with correct parameters?)
  • Conversation Flow (Natural turn-taking, appropriate responses)

Additional metrics include audio-text consistency and general quality assessments.

Technical Architecture

The harness handles four challenges unique to speech-to-speech model testing:

  1. Bidirectional streaming: Manages persistent, full-duplex connections where audio and text flow simultaneously, unlike standard HTTP request-response patterns
  2. Non-deterministic responses: Evaluates against rubrics rather than exact string matching, since the same question produces different wording each time
  3. Session management: Automatically handles Nova Sonic's 8-minute connection timeout by creating new sessions and replaying conversation history
  4. Audio hallucination detection: Identifies cases where audio output diverges from text output (e.g., audio says "3:30 PM" while text reads "3:00 PM")

Test scenarios are defined in JSON configuration files that specify system prompts, voice IDs, available tools, user personas, and evaluation criteria. The framework supports both text input (faster) and synthesized audio via Amazon Polly for full speech recognition pipeline testing.

Evaluation Pipeline

Each test follows a four-phase pipeline:

  1. Configuration: Define scenario with JSON including Nova Sonic's role, user persona, tools, and success criteria
  2. Conversation execution: User simulator (powered by models like Claude Haiku on Bedrock) generates messages, Nova Sonic responds, tool calls execute in-stream
  3. Turn completion detection: Uses Nova Sonic's two-stage text production (speculative then final) to determine when turns end, more reliably than silence detection
  4. LLM judge evaluation: Separate LLM (e.g., Claude Opus) assesses full transcript against criteria without knowing test setup

All conversation artifacts—text transcripts, audio WAV files, tool calls, and timing metadata—are logged for analysis.

What This Means

The release addresses a fundamental gap in voice AI quality assurance. While text-based LLM testing has established frameworks, speech-to-speech models require different approaches due to their streaming, non-deterministic nature. By automating evaluation, the harness enables rapid iteration on system prompts and tool configurations—previously a manual bottleneck—and provides regression testing capabilities before production deployment. The open-source release suggests AWS is building developer tooling around Nova Sonic to support enterprise voice agent adoption, though the framework's effectiveness will depend on how well LLM judges can assess subjective qualities like conversation naturalness compared to human evaluators.

Related Articles

product update

AWS Launches AgentCore Runtime for Persistent Coding Agent Sessions That Don't Die When Laptops Close

Amazon Web Services has launched AgentCore Runtime on Bedrock, providing dedicated Linux microVMs with persistent 14-day storage for coding agents. The service eliminates the need to keep laptops open during agent sessions and supports parallel execution of Claude Code, Codex, Kiro, OpenCode, and other coding agents with isolated environments.

product update

Google NotebookLM adds Gemini 3.5, code execution via Antigravity, and 10+ export formats

Google upgraded NotebookLM to use Gemini 3.5 and its Antigravity coding tool, enabling code execution through a "secure cloud computer" with 100+ software skills. The system achieved a 78.2% win rate against the previous baseline in web research tasks and now exports to 10+ formats including PDF, XLSX, and PPTX.

product update

OpenAI rolls out ChatGPT Lockdown mode to all users to block prompt injection data theft

OpenAI has expanded Lockdown mode to all ChatGPT plan tiers, including Free, Go, Plus, Pro, and Business users. The security feature blocks outbound network requests to prevent prompt injection attacks from stealing sensitive data, but disables live web browsing, Deep Research, and Agent mode.

product update

OpenAI's ChatGPT Memory V3 now profiles users across all conversations, raises accuracy and privacy concerns

OpenAI has deployed Dreaming V3, a background memory synthesis system that builds comprehensive user profiles from chat history. The company reports factual task recall jumped from 41% in 2024 to 82% in 2026, while reducing compute costs by 5X. However, testing reveals the system stores outdated and incorrect information that persists even when users disable memory features.

Comments

Loading...