Amazon Bedrock AgentCore Evaluations now generally available for testing AI agents
Amazon Bedrock AgentCore Evaluations, a fully managed service for assessing AI agent performance, is now generally available following its public preview debut at AWS re:Invent 2025. The service addresses the core challenge that LLMs are non-deterministic—the same user query can produce different tool selections and outputs across runs—making traditional single-pass testing inadequate for reliable agent deployment.
Amazon Bedrock AgentCore Evaluations Now Generally Available
Amazon Web Services has announced general availability of Amazon Bedrock AgentCore Evaluations, a fully managed service for measuring AI agent performance across the development lifecycle. The service, which entered public preview at AWS re:Invent 2025, addresses a fundamental gap in agent testing: the inability to reliably assess non-deterministic systems at scale.
The Core Problem: Non-Deterministic Agent Behavior
Traditional software testing assumes deterministic outputs—the same input produces the same output every time. LLM-based agents violate this assumption entirely. The same user query can trigger different tool selections, reasoning paths, and responses across multiple runs. This means a single test pass reveals what can happen, not what typically happens.
Without systematic measurement across these variations, teams resort to manual testing cycles and reactive debugging—burning API costs without clear visibility into whether changes actually improve performance. Every prompt modification becomes risky when you cannot quantify its impact.
How AgentCore Evaluations Works
The service operates on three core principles:
Evidence-driven development: Replaces intuition with quantitative metrics, enabling teams to measure actual impact of changes rather than debating whether modifications "feel better."
Multi-dimensional assessment: Evaluates different aspects of agent behavior independently—tool selection accuracy, parameter correctness, response quality, and user experience—rather than relying on a single aggregate score.
Continuous measurement: Connects development baselines directly to production monitoring, ensuring quality holds as real-world conditions evolve.
Three Evaluation Approaches
AgentCore Evaluations supports three configuration methods:
-
LLM-as-a-Judge: An LLM evaluates agent interactions against structured rubrics, examining conversation history, available tools, tool calls, parameters, and system instructions. Each score includes detailed reasoning and explanations.
-
Ground Truth evaluation: Compares agent responses against pre-defined or simulated datasets for deterministic validation.
-
Custom code evaluators: Users can bring AWS Lambda functions with proprietary evaluation logic.
Technical Foundation and Compatibility
The service builds on OpenTelemetry (OTEL) traces with generative AI semantic conventions, an open observability standard extended with fields specific to LLM interactions including prompts, completions, tool calls, and model parameters. This standardized approach enables AgentCore Evaluations to work consistently across agents built with Strands Agents, LangGraph, and any system instrumented with OpenTelemetry and OpenInference.
Amazon fully manages model quotas and inference capacity for built-in evaluators, meaning organizations evaluating multiple agents do not consume their own API quotas or require separate evaluation infrastructure.
Lifecycle Coverage: Development to Production
The service supports two distinct evaluation phases:
Development: Controlled environments for comparing alternatives, testing on curated datasets, reproducing results, and validating changes before deployment.
Production: Real-world interaction monitoring at scale, including shadow testing, A/B testing, and continuous performance tracking as users encounter unanticipated edge cases.
This dual approach addresses the gap between agent behavior in demos and controlled tests versus actual performance when exposed to production traffic and real user patterns.
What This Means
AgentCore Evaluations shifts agent development from intuition-based iteration to data-driven validation. For organizations deploying multiple agents, the service eliminates the infrastructure overhead of building custom evaluation systems—a problem that previously consumed more engineering effort than acting on evaluation results.
The emphasis on multi-dimensional scoring and transparent reasoning (especially in LLM-as-a-Judge mode) allows teams to pinpoint failure modes precisely rather than receiving opaque pass/fail verdicts. For teams building production agents, this means confidence in deployment decisions backed by quantitative evidence rather than manual spot-checking.
However, the service's value depends entirely on how well teams define their evaluation criteria upfront. Poorly defined rubrics optimize for wrong outcomes. The continuous evaluation cycle—where failures become new test cases—means evaluation quality improves iteratively, but only if teams actively use the scoring explanations to refine their test datasets and success criteria.
Related Articles
AWS launches AgentCore Observability for Amazon Bedrock to debug production AI agents
Amazon Web Services launched AgentCore Observability for Amazon Bedrock, a debugging tool that provides visibility into AI agent execution through OpenTelemetry traces, CloudWatch metrics, and structured logs. The tool addresses silent failures in production agents including infinite reasoning loops, incorrect tool selection, and plausible but incorrect answers.
Google brings personalized image generation to all US Gemini users, expanding from paid-only feature
Google is expanding personalized image generation in the Gemini app to all eligible US users, removing the previous restriction to AI Pro and Ultra subscribers. The feature allows Gemini to access user data across Google services like Gmail and Photos when generating images.
Google makes Gemini's personalized image generation free for all U.S. users
Google removed the paywall for Gemini's personalized image generation feature, making it free for all eligible U.S. users starting today. The Nano Banana-powered feature was previously limited to Plus, Pro, and Ultra subscribers.
Google expands Gemini-powered meeting transcription to AI Pro ($19.99/month) and Ultra subscribers
Google has expanded access to its Gemini-powered 'Take notes for me' feature in Google Meet to AI Pro and Ultra subscribers. The tool, previously limited to Workspace customers since 2024, automatically transcribes calls and generates summaries with action items for $19.99 per month.
Comments
Loading...