AWS releases open-source test harness for evaluating Amazon Nova Sonic voice agents at scale

TL;DR

Amazon has released an open-source testing framework for Nova Sonic voice agents that automates multi-turn conversation evaluation without requiring human testers. The harness uses LLM-as-judge techniques to assess voice agents across six metrics including goal achievement, response accuracy, and tool usage, addressing a critical QA bottleneck in voice AI development.

June 8, 2026 · 4:05 PM3 min read

AWS Releases Open-Source Test Harness for Evaluating Amazon Nova Sonic Voice Agents at Scale

Amazon has released an open-source testing framework that automates quality assurance for Nova Sonic voice agents, eliminating the need for manual conversation testing. The Nova Sonic Test Harness addresses a critical bottleneck in voice AI development: the inability to systematically test bidirectional audio streaming applications.

Testing Voice Agents Without Human Testers

The framework runs complete multi-turn conversations with Amazon Nova Sonic automatically using an LLM-powered user simulator. According to AWS, teams previously needed to manually test voice agents by having someone physically talk to the system—a process that doesn't scale when evaluating 50 conversation scenarios across 3 user personas, requiring 150 manual tests taking several minutes each.

The test harness uses LLM-as-judge evaluation techniques to assess conversations across six built-in metrics organized into three tiers:

Critical tier:

Goal Achievement (Did the conversation accomplish the user's objective?)
Response Accuracy (Were facts, numbers, and claims correct?)

Important tier:

Tool Usage (Were the right tools called with correct parameters?)
Conversation Flow (Natural turn-taking, appropriate responses)

Additional metrics include audio-text consistency and general quality assessments.

Technical Architecture

The harness handles four challenges unique to speech-to-speech model testing:

Bidirectional streaming: Manages persistent, full-duplex connections where audio and text flow simultaneously, unlike standard HTTP request-response patterns
Non-deterministic responses: Evaluates against rubrics rather than exact string matching, since the same question produces different wording each time
Session management: Automatically handles Nova Sonic's 8-minute connection timeout by creating new sessions and replaying conversation history
Audio hallucination detection: Identifies cases where audio output diverges from text output (e.g., audio says "3:30 PM" while text reads "3:00 PM")

Test scenarios are defined in JSON configuration files that specify system prompts, voice IDs, available tools, user personas, and evaluation criteria. The framework supports both text input (faster) and synthesized audio via Amazon Polly for full speech recognition pipeline testing.

Evaluation Pipeline

Each test follows a four-phase pipeline:

Configuration: Define scenario with JSON including Nova Sonic's role, user persona, tools, and success criteria
Conversation execution: User simulator (powered by models like Claude Haiku on Bedrock) generates messages, Nova Sonic responds, tool calls execute in-stream
Turn completion detection: Uses Nova Sonic's two-stage text production (speculative then final) to determine when turns end, more reliably than silence detection
LLM judge evaluation: Separate LLM (e.g., Claude Opus) assesses full transcript against criteria without knowing test setup

All conversation artifacts—text transcripts, audio WAV files, tool calls, and timing metadata—are logged for analysis.

What This Means

The release addresses a fundamental gap in voice AI quality assurance. While text-based LLM testing has established frameworks, speech-to-speech models require different approaches due to their streaming, non-deterministic nature. By automating evaluation, the harness enables rapid iteration on system prompts and tool configurations—previously a manual bottleneck—and provides regression testing capabilities before production deployment. The open-source release suggests AWS is building developer tooling around Nova Sonic to support enterprise voice agent adoption, though the framework's effectiveness will depend on how well LLM judges can assess subjective qualities like conversation naturalness compared to human evaluators.

Source: aws.amazon.com ↗

amazon-nova-sonic voice-agents testing-framework llm-as-judge aws open-source speech-to-speech quality-assurance

product updateJuly 16, 2026

AWS launches Managed Knowledge Base for Bedrock with 6 enterprise connectors and automatic ACL enforcement

Amazon Web Services launched Managed Knowledge Base for Bedrock in general availability, offering a fully managed retrieval solution with six native enterprise connectors including SharePoint, Confluence, and Google Drive. The service handles document parsing up to 500 MB for PDFs, 2 GB for audio, and 10 GB for video, with real-time access control list verification at query time.

product updateJuly 16, 2026

LM Studio launches Bionic, agentic app for local and cloud open-source models

LM Studio released Bionic, a Mac app that runs open-source AI models locally or via cloud for coding, document processing, and research tasks. The app includes offline voice transcription using Mistral's Voxtral model and supports models like GLM 5.2 and Kimi K2.7 Code for codebase editing.

product updateJuly 16, 2026

xAI's Grok 4.3 now available on AWS Bedrock with 1M token context and configurable reasoning

xAI has made Grok 4.3 generally available on Amazon Bedrock, marking xAI's debut as a Bedrock model provider. The multimodal model offers a 1 million token context window, configurable reasoning effort (none/low/medium/high), and runs on Bedrock's Mantle inference engine using OpenAI-compatible APIs.

product updateJuly 16, 2026

AWS launches AgentCore platform for building voice AI agents with Amazon Nova 2 Sonic

AWS has released AgentCore, a new platform for hosting and running voice-based AI agents, integrated with Amazon Nova 2 Sonic for real-time speech capabilities. The platform uses the open Model Context Protocol (MCP) to connect agents to backend systems and deploys each conversation in isolated microVMs.

AWS releases open-source test harness for evaluating Amazon Nova Sonic voice agents at scale

AWS Releases Open-Source Test Harness for Evaluating Amazon Nova Sonic Voice Agents at Scale

Testing Voice Agents Without Human Testers

Technical Architecture

Evaluation Pipeline

What This Means

Related Articles

AWS launches Managed Knowledge Base for Bedrock with 6 enterprise connectors and automatic ACL enforcement

LM Studio launches Bionic, agentic app for local and cloud open-source models

xAI's Grok 4.3 now available on AWS Bedrock with 1M token context and configurable reasoning

AWS launches AgentCore platform for building voice AI agents with Amazon Nova 2 Sonic

Comments