product updateAmazon Web Services

Amazon Bedrock AgentCore Evaluations now generally available for testing AI agents

TL;DR

Amazon Bedrock AgentCore Evaluations, a fully managed service for assessing AI agent performance, is now generally available following its public preview debut at AWS re:Invent 2025. The service addresses the core challenge that LLMs are non-deterministic—the same user query can produce different tool selections and outputs across runs—making traditional single-pass testing inadequate for reliable agent deployment.

3 min read
0

Amazon Bedrock AgentCore Evaluations Now Generally Available

Amazon Web Services has announced general availability of Amazon Bedrock AgentCore Evaluations, a fully managed service for measuring AI agent performance across the development lifecycle. The service, which entered public preview at AWS re:Invent 2025, addresses a fundamental gap in agent testing: the inability to reliably assess non-deterministic systems at scale.

The Core Problem: Non-Deterministic Agent Behavior

Traditional software testing assumes deterministic outputs—the same input produces the same output every time. LLM-based agents violate this assumption entirely. The same user query can trigger different tool selections, reasoning paths, and responses across multiple runs. This means a single test pass reveals what can happen, not what typically happens.

Without systematic measurement across these variations, teams resort to manual testing cycles and reactive debugging—burning API costs without clear visibility into whether changes actually improve performance. Every prompt modification becomes risky when you cannot quantify its impact.

How AgentCore Evaluations Works

The service operates on three core principles:

Evidence-driven development: Replaces intuition with quantitative metrics, enabling teams to measure actual impact of changes rather than debating whether modifications "feel better."

Multi-dimensional assessment: Evaluates different aspects of agent behavior independently—tool selection accuracy, parameter correctness, response quality, and user experience—rather than relying on a single aggregate score.

Continuous measurement: Connects development baselines directly to production monitoring, ensuring quality holds as real-world conditions evolve.

Three Evaluation Approaches

AgentCore Evaluations supports three configuration methods:

  1. LLM-as-a-Judge: An LLM evaluates agent interactions against structured rubrics, examining conversation history, available tools, tool calls, parameters, and system instructions. Each score includes detailed reasoning and explanations.

  2. Ground Truth evaluation: Compares agent responses against pre-defined or simulated datasets for deterministic validation.

  3. Custom code evaluators: Users can bring AWS Lambda functions with proprietary evaluation logic.

Technical Foundation and Compatibility

The service builds on OpenTelemetry (OTEL) traces with generative AI semantic conventions, an open observability standard extended with fields specific to LLM interactions including prompts, completions, tool calls, and model parameters. This standardized approach enables AgentCore Evaluations to work consistently across agents built with Strands Agents, LangGraph, and any system instrumented with OpenTelemetry and OpenInference.

Amazon fully manages model quotas and inference capacity for built-in evaluators, meaning organizations evaluating multiple agents do not consume their own API quotas or require separate evaluation infrastructure.

Lifecycle Coverage: Development to Production

The service supports two distinct evaluation phases:

Development: Controlled environments for comparing alternatives, testing on curated datasets, reproducing results, and validating changes before deployment.

Production: Real-world interaction monitoring at scale, including shadow testing, A/B testing, and continuous performance tracking as users encounter unanticipated edge cases.

This dual approach addresses the gap between agent behavior in demos and controlled tests versus actual performance when exposed to production traffic and real user patterns.

What This Means

AgentCore Evaluations shifts agent development from intuition-based iteration to data-driven validation. For organizations deploying multiple agents, the service eliminates the infrastructure overhead of building custom evaluation systems—a problem that previously consumed more engineering effort than acting on evaluation results.

The emphasis on multi-dimensional scoring and transparent reasoning (especially in LLM-as-a-Judge mode) allows teams to pinpoint failure modes precisely rather than receiving opaque pass/fail verdicts. For teams building production agents, this means confidence in deployment decisions backed by quantitative evidence rather than manual spot-checking.

However, the service's value depends entirely on how well teams define their evaluation criteria upfront. Poorly defined rubrics optimize for wrong outcomes. The continuous evaluation cycle—where failures become new test cases—means evaluation quality improves iteratively, but only if teams actively use the scoring explanations to refine their test datasets and success criteria.

Related Articles

product update

Anthropic's Claude Code leak exposes Tamagotchi pet and always-on agent features

A source code leak in Anthropic's Claude Code 2.1.88 update exposed more than 512,000 lines of TypeScript, revealing unreleased features including a Tamagotchi-like pet interface and a KAIROS feature for background agent automation. Anthropic confirmed the leak was caused by a packaging error, not a security breach, and has since fixed the issue.

product update

AWS launches QA Studio: Natural language test automation powered by Amazon Nova Act

AWS has released QA Studio, a reference solution for QA automation built on Amazon Nova Act that enables teams to define tests in natural language rather than code. The system uses visual understanding to navigate applications like users do, automatically adapting to UI changes and eliminating maintenance overhead from traditional selector-based testing frameworks.

product update

Microsoft expands Copilot Cowork with AI model critique feature and cross-model comparison

Microsoft is expanding Copilot Cowork availability and introducing a Critique function that enables one AI model to review another's output. The update also includes a new Researcher agent claiming best-in-class deep research performance, outperforming Perplexity by 7 points, and a Model Council feature for direct model comparison.

product update

Google rolls out AI Inbox beta to Gmail's AI Ultra subscribers

Google is rolling out AI Inbox in Gmail to AI Ultra subscribers ($249.99/month) following its January announcement. The feature uses Gemini 3 to automatically surface actionable tasks and topics, replacing the need to manually scan message lists.

Comments

Loading...