product updateAmazon Web Services

Amazon Bedrock AgentCore Evaluations now generally available for testing AI agents

TL;DR

Amazon Bedrock AgentCore Evaluations, a fully managed service for assessing AI agent performance, is now generally available following its public preview debut at AWS re:Invent 2025. The service addresses the core challenge that LLMs are non-deterministic—the same user query can produce different tool selections and outputs across runs—making traditional single-pass testing inadequate for reliable agent deployment.

3 min read
0

Amazon Bedrock AgentCore Evaluations Now Generally Available

Amazon Web Services has announced general availability of Amazon Bedrock AgentCore Evaluations, a fully managed service for measuring AI agent performance across the development lifecycle. The service, which entered public preview at AWS re:Invent 2025, addresses a fundamental gap in agent testing: the inability to reliably assess non-deterministic systems at scale.

The Core Problem: Non-Deterministic Agent Behavior

Traditional software testing assumes deterministic outputs—the same input produces the same output every time. LLM-based agents violate this assumption entirely. The same user query can trigger different tool selections, reasoning paths, and responses across multiple runs. This means a single test pass reveals what can happen, not what typically happens.

Without systematic measurement across these variations, teams resort to manual testing cycles and reactive debugging—burning API costs without clear visibility into whether changes actually improve performance. Every prompt modification becomes risky when you cannot quantify its impact.

How AgentCore Evaluations Works

The service operates on three core principles:

Evidence-driven development: Replaces intuition with quantitative metrics, enabling teams to measure actual impact of changes rather than debating whether modifications "feel better."

Multi-dimensional assessment: Evaluates different aspects of agent behavior independently—tool selection accuracy, parameter correctness, response quality, and user experience—rather than relying on a single aggregate score.

Continuous measurement: Connects development baselines directly to production monitoring, ensuring quality holds as real-world conditions evolve.

Three Evaluation Approaches

AgentCore Evaluations supports three configuration methods:

  1. LLM-as-a-Judge: An LLM evaluates agent interactions against structured rubrics, examining conversation history, available tools, tool calls, parameters, and system instructions. Each score includes detailed reasoning and explanations.

  2. Ground Truth evaluation: Compares agent responses against pre-defined or simulated datasets for deterministic validation.

  3. Custom code evaluators: Users can bring AWS Lambda functions with proprietary evaluation logic.

Technical Foundation and Compatibility

The service builds on OpenTelemetry (OTEL) traces with generative AI semantic conventions, an open observability standard extended with fields specific to LLM interactions including prompts, completions, tool calls, and model parameters. This standardized approach enables AgentCore Evaluations to work consistently across agents built with Strands Agents, LangGraph, and any system instrumented with OpenTelemetry and OpenInference.

Amazon fully manages model quotas and inference capacity for built-in evaluators, meaning organizations evaluating multiple agents do not consume their own API quotas or require separate evaluation infrastructure.

Lifecycle Coverage: Development to Production

The service supports two distinct evaluation phases:

Development: Controlled environments for comparing alternatives, testing on curated datasets, reproducing results, and validating changes before deployment.

Production: Real-world interaction monitoring at scale, including shadow testing, A/B testing, and continuous performance tracking as users encounter unanticipated edge cases.

This dual approach addresses the gap between agent behavior in demos and controlled tests versus actual performance when exposed to production traffic and real user patterns.

What This Means

AgentCore Evaluations shifts agent development from intuition-based iteration to data-driven validation. For organizations deploying multiple agents, the service eliminates the infrastructure overhead of building custom evaluation systems—a problem that previously consumed more engineering effort than acting on evaluation results.

The emphasis on multi-dimensional scoring and transparent reasoning (especially in LLM-as-a-Judge mode) allows teams to pinpoint failure modes precisely rather than receiving opaque pass/fail verdicts. For teams building production agents, this means confidence in deployment decisions backed by quantitative evidence rather than manual spot-checking.

However, the service's value depends entirely on how well teams define their evaluation criteria upfront. Poorly defined rubrics optimize for wrong outcomes. The continuous evaluation cycle—where failures become new test cases—means evaluation quality improves iteratively, but only if teams actively use the scoring explanations to refine their test datasets and success criteria.

Related Articles

product update

Google names upcoming Gemini AI agent 'Spark,' adds autonomous task execution to mobile app

Google is preparing to launch Gemini Spark, an autonomous AI agent that will operate within the Gemini mobile app. According to code found in Google app beta version 17.23, Spark can access connected apps, personal data, and location to execute tasks like managing inboxes and scheduling meetings, though Google warns it may occasionally act without permission.

product update

OpenAI adds remote Codex control to ChatGPT mobile apps for iOS and Android

OpenAI has integrated remote Codex control into the ChatGPT mobile apps for iPhone and Android. Users can now approve tasks, review outputs, and manage Codex running on Mac computers, laptops, or remote environments directly from their smartphones.

product update

AWS launches real-time voice agent framework combining Stream Vision Agents with Nova 2 Sonic

Amazon has released Stream's Vision Agents, an open-source Python framework for building real-time voice AI agents that integrates with Amazon Nova 2 Sonic through Bedrock. The system delivers end-to-end latency under 500 milliseconds using Stream's global edge network with sub-30ms audio latency and typically sub-500ms join times.

product update

Notion launches Developer Platform with custom code execution, agent orchestration, and database sync

Notion has launched a Developer Platform that allows teams to run custom code in cloud-based Workers, sync external databases, and orchestrate both internal and external AI agents. The platform, free through August, supports integration with Claude Code, Cursor, Codex, and Decagon, and uses Model Context Protocol for agent connectivity.

Comments

Loading...