product updateAmazon Web Services

Loka Achieves 87% Speech Reasoning Accuracy Using Amazon Nova 2 Sonic, Outperforming GPT Realtime and Gemini

TL;DR

Loka built a conversational voice agent using Amazon Nova 2 Sonic that achieved 87.0% speech reasoning accuracy on Big Bench Audio, surpassing GPT Realtime at 83.0% and Gemini 2.5 Flash Native Audio at 71.0%. The system delivers Time to First Audio of 1.39 seconds at approximately $0.27 per hour of input audio.

2 min read
0

Loka Achieves 87% Speech Reasoning Accuracy Using Amazon Nova 2 Sonic

Loka, working with Amazon Web Services, has deployed a voice AI agent using Amazon Nova 2 Sonic that scored 87.0% on the Big Bench Audio speech reasoning benchmark, according to AWS. This outperformed GPT Realtime at 83.0% and Gemini 2.5 Flash Native Audio (Live API) at 71.0%.

Performance Metrics

The implementation achieved Time to First Audio of 1.39 seconds, which AWS states enables natural conversation interruption patterns. Pricing runs approximately $0.27 per hour of input audio processed, according to AWS documentation at the time of publication.

The system uses native speech-to-speech processing, bypassing the traditional three-stage pipeline of Speech-to-Text, LLM processing, and Text-to-Speech. Traditional systems typically introduce 3 to 5 second response delays due to this multi-stage architecture.

Evaluation Methodology

Loka built an automated evaluation pipeline using LLM-as-a-judge scoring across five dimensions on a 1-5 scale:

  • Response Appropriateness: Improved from 2.5 to 2.9 (baseline to Nova 2 Sonic)
  • Intent Understanding: 2.9 to 3.0
  • Completeness: 1.8 to 2.5 (+0.7, largest gain)
  • Conversational Naturalness: 2.5 to 2.8
  • Overall Score: 2.4 to 2.7

After two iterations of prompt engineering, the team achieved an overall score of 3.8 out of 5.0, up from the 2.7 baseline.

Technical Architecture

The system processes audio streams directly to the model, preserving tone, emotion, and timing information lost in text-only pipelines. Loka used AWS Bedrock Prompt Management to version control prompt templates with unique ARNs, allowing deployment without application code changes.

The team created templatized prompts with variables like {assistant_name} and {dealership_address} to enable multi-tenant deployment. AWS IAM controls govern who can author, approve, or deploy prompt changes.

Use Case: Automotive Dealership Voice Agents

The deployment targets automotive dealerships handling customer inquiries. Example scenarios include parsing multi-part requests: "I'm looking for that SUV you advertised, but not the hybrid one. I can only come in after 5 PM."

Traditional systems struggled with such requests because speech-to-text conversion loses crucial context like tone, hesitation, and urgency. The 3-5 second delays in legacy systems proved particularly problematic in sales contexts where immediate responses matter.

Cost and Scale Considerations

AWS claims the $0.27 per hour pricing makes the system viable for serving thousands of dealership locations. Traditional real-time voice systems became cost-prohibitive at scale when processing continuous audio streams.

What This Means

This case study provides concrete benchmark data showing native speech-to-speech models can outperform traditional pipelines on reasoning tasks while reducing latency and cost. The 87% Big Bench Audio score demonstrates that end-to-end audio processing doesn't sacrifice intelligence for speed. However, the overall quality scores of 2.7 to 3.8 out of 5.0 suggest significant room for improvement before these systems match human-level conversation quality. The economic case becomes compelling primarily for high-volume deployments where per-hour costs matter more than absolute quality.

Related Articles

product update

AWS releases healthcare appointment agent tutorial using Nova 2 Sonic speech-to-speech model

AWS published a technical guide for building voice appointment agents using Amazon Nova 2 Sonic, a speech-to-speech model that processes audio natively without separate transcription steps. The tutorial covers authentication, scheduling, and escalation tools running on Amazon Bedrock AgentCore with DynamoDB persistence.

product update

OpenAI releases GPT-5.5-Cyber with 85.6% CyberGym score, surpassing restricted Anthropic model

OpenAI released an updated GPT-5.5-Cyber model that scores 85.6% on CyberGym, surpassing Anthropic's Mythos 5 (83.8%) — the same model that triggered Trump administration export controls. The release proceeds without the political pushback that forced Anthropic to restrict foreign national access.

product update

Google adds screen selection tool to Chrome's Gemini panel, integrates computer use into Gemini 3.5 Flash API

Google has added a screen selection tool to Chrome 149's Gemini panel that allows users to capture text or images from their current tab for prompts. Separately, the company integrated computer use capabilities directly into the Gemini 3.5 Flash model API, replacing the standalone Gemini 2.5 Computer Use model.

product update

Mistral adds workspace-level connector controls, multi-account authentication, and debugging tools

Mistral AI released new enterprise connector features including workspace-level access controls, multi-account authentication for single connectors, and a debugging tool for Model Context Protocol (MCP) connections. The updates address production deployment challenges for AI agents accessing enterprise data systems.

Comments

Loading...