AWS releases open-source test harness for evaluating Amazon Nova Sonic voice agents at scale
Amazon has released an open-source testing framework for Nova Sonic voice agents that automates multi-turn conversation evaluation without requiring human testers. The harness uses LLM-as-judge techniques to assess voice agents across six metrics including goal achievement, response accuracy, and tool usage, addressing a critical QA bottleneck in voice AI development.
AWS Releases Open-Source Test Harness for Evaluating Amazon Nova Sonic Voice Agents at Scale
Amazon has released an open-source testing framework that automates quality assurance for Nova Sonic voice agents, eliminating the need for manual conversation testing. The Nova Sonic Test Harness addresses a critical bottleneck in voice AI development: the inability to systematically test bidirectional audio streaming applications.
Testing Voice Agents Without Human Testers
The framework runs complete multi-turn conversations with Amazon Nova Sonic automatically using an LLM-powered user simulator. According to AWS, teams previously needed to manually test voice agents by having someone physically talk to the system—a process that doesn't scale when evaluating 50 conversation scenarios across 3 user personas, requiring 150 manual tests taking several minutes each.
The test harness uses LLM-as-judge evaluation techniques to assess conversations across six built-in metrics organized into three tiers:
Critical tier:
- Goal Achievement (Did the conversation accomplish the user's objective?)
- Response Accuracy (Were facts, numbers, and claims correct?)
Important tier:
- Tool Usage (Were the right tools called with correct parameters?)
- Conversation Flow (Natural turn-taking, appropriate responses)
Additional metrics include audio-text consistency and general quality assessments.
Technical Architecture
The harness handles four challenges unique to speech-to-speech model testing:
- Bidirectional streaming: Manages persistent, full-duplex connections where audio and text flow simultaneously, unlike standard HTTP request-response patterns
- Non-deterministic responses: Evaluates against rubrics rather than exact string matching, since the same question produces different wording each time
- Session management: Automatically handles Nova Sonic's 8-minute connection timeout by creating new sessions and replaying conversation history
- Audio hallucination detection: Identifies cases where audio output diverges from text output (e.g., audio says "3:30 PM" while text reads "3:00 PM")
Test scenarios are defined in JSON configuration files that specify system prompts, voice IDs, available tools, user personas, and evaluation criteria. The framework supports both text input (faster) and synthesized audio via Amazon Polly for full speech recognition pipeline testing.
Evaluation Pipeline
Each test follows a four-phase pipeline:
- Configuration: Define scenario with JSON including Nova Sonic's role, user persona, tools, and success criteria
- Conversation execution: User simulator (powered by models like Claude Haiku on Bedrock) generates messages, Nova Sonic responds, tool calls execute in-stream
- Turn completion detection: Uses Nova Sonic's two-stage text production (speculative then final) to determine when turns end, more reliably than silence detection
- LLM judge evaluation: Separate LLM (e.g., Claude Opus) assesses full transcript against criteria without knowing test setup
All conversation artifacts—text transcripts, audio WAV files, tool calls, and timing metadata—are logged for analysis.
What This Means
The release addresses a fundamental gap in voice AI quality assurance. While text-based LLM testing has established frameworks, speech-to-speech models require different approaches due to their streaming, non-deterministic nature. By automating evaluation, the harness enables rapid iteration on system prompts and tool configurations—previously a manual bottleneck—and provides regression testing capabilities before production deployment. The open-source release suggests AWS is building developer tooling around Nova Sonic to support enterprise voice agent adoption, though the framework's effectiveness will depend on how well LLM judges can assess subjective qualities like conversation naturalness compared to human evaluators.
Related Articles
AWS Launches AgentCore Runtime for Persistent Coding Agent Sessions That Don't Die When Laptops Close
Amazon Web Services has launched AgentCore Runtime on Bedrock, providing dedicated Linux microVMs with persistent 14-day storage for coding agents. The service eliminates the need to keep laptops open during agent sessions and supports parallel execution of Claude Code, Codex, Kiro, OpenCode, and other coding agents with isolated environments.
Google NotebookLM adds Gemini 3.5, code execution via Antigravity, and 10+ export formats
Google upgraded NotebookLM to use Gemini 3.5 and its Antigravity coding tool, enabling code execution through a "secure cloud computer" with 100+ software skills. The system achieved a 78.2% win rate against the previous baseline in web research tasks and now exports to 10+ formats including PDF, XLSX, and PPTX.
OpenAI rolls out ChatGPT Lockdown mode to all users to block prompt injection data theft
OpenAI has expanded Lockdown mode to all ChatGPT plan tiers, including Free, Go, Plus, Pro, and Business users. The security feature blocks outbound network requests to prevent prompt injection attacks from stealing sensitive data, but disables live web browsing, Deep Research, and Agent mode.
OpenAI's ChatGPT Memory V3 now profiles users across all conversations, raises accuracy and privacy concerns
OpenAI has deployed Dreaming V3, a background memory synthesis system that builds comprehensive user profiles from chat history. The company reports factual task recall jumped from 41% in 2024 to 82% in 2026, while reducing compute costs by 5X. However, testing reveals the system stores outdated and incorrect information that persists even when users disable memory features.
Comments
Loading...