AWS launches dataset management in Bedrock AgentCore for versioned agent test suites

TL;DR

Amazon Web Services introduced dataset management in Bedrock AgentCore, enabling developers to build versioned test suites with immutable baselines for agent evaluation. The feature supports predefined scenarios with ground truth assertions and user simulation scenarios where LLM-backed actors conduct multi-turn conversations.

May 28, 2026 · 6:20 PM3 min read

AWS launches dataset management in Bedrock AgentCore for versioned agent test suites

Two scenario types for different testing needs

The system handles two distinct schema types. Predefined scenarios capture specific inputs, expected outputs, tool sequences, and assertions that must hold across runs. According to AWS, these function as backward-looking tests that formalize known failures into permanent test cases.

User simulation scenarios take a different approach. Instead of scripted turns, developers define an actor persona with traits, context, and goals. An LLM-backed actor then drives a real multi-turn conversation with the agent until completion or turn limit. AWS states this tests whether an agent can satisfy a type of user across any path that user takes, not just handle specific inputs.

Versioning solves the moving baseline problem

The core issue addressed is measurement consistency. Agents are non-deterministic by design—the same input can produce different outputs. Without fixed test inputs, developers cannot distinguish whether a score changed due to agent improvements or different model sampling.

Datasets support two workflows. The inner loop operates at developer desk speed, measured in minutes. Developers iterate on a mutable draft dataset, curating production failures and adjusting test cases. The outer loop runs in CI/CD pipelines. Teams publish numbered versions of datasets that become immutable checkpoints. Each pipeline run evaluates against the same locked inputs with identical ground truth assertions.

AWS provided a financial market intelligence agent as reference implementation. The agent serves investment brokers, retrieves stock prices, searches Bloomberg and Reuters, and maintains conversation state through Bedrock Memory. A predefined scenario might verify the agent correctly identifies a broker and stores sector preferences. A simulated scenario defines a senior tech analyst persona who probes for citable analysis on NVIDIA versus AMD, pushing back on thin responses until satisfied.

Ground truth distinguishes correctness from appearance

According to AWS, LLM judges can assess whether responses sound helpful but cannot verify factual accuracy, correct tool execution order, or PII isolation. Ground truth assertions make these checks explicit. Without them, evaluation measures the appearance of correctness rather than correctness itself.

The feature integrates directly with existing Bedrock AgentCore evaluation infrastructure. Developers author scenarios with expected trajectories and assertions, publish them as immutable versions, run evaluations, and confirm improvements against the same locked inputs.

What this means

This addresses a genuine gap in agent development tooling. Most teams have CI/CD gates for agent changes but lack stable, versioned test fixtures underneath. The result is pipelines that pass builds when test questions change rather than catching actual regressions. The combination of versioned datasets with ground truth assertions and user simulation gives developers a systematic way to measure whether agent changes represent real improvements. The financial agent example demonstrates practical implementation patterns for both predefined and simulated scenarios, though pricing for dataset management was not disclosed in the announcement.

Source: aws.amazon.com ↗

aws amazon-bedrock agent-evaluation testing dataset-management ci-cd

product updateJuly 14, 2026

AWS Extends QA Studio with Test Suites and CI/CD CLI for Automated Regression Testing

AWS has extended its QA Studio reference solution with test suite functionality and a command-line interface for CI/CD integration. The updates enable parallel execution of regression tests on Amazon ECS Fargate and bring Amazon Nova Act-powered visual testing into automated deployment pipelines.

product updateJuly 14, 2026

Amazon Nova Act Brings Vision-Based Web Navigation to UX Testing, No Hard-Coded Scripts Required

AWS has released a cloud-deployed UX testing platform built on Amazon Nova Act, a multimodal foundation model that navigates web interfaces through visual understanding rather than hard-coded selectors. The solution processes documentation with Claude 4.5 Sonnet to generate test scenarios, executes parallel testing via ECS, and analyzes results automatically, addressing the scalability limitations of manual testing and maintenance overhead of traditional automation tools.

product updateJuly 10, 2026

AWS Adds NVIDIA Nemotron 3 Nano (30B) and Super (120B) to SageMaker Serverless Fine-Tuning

Amazon SageMaker AI now supports serverless fine-tuning for NVIDIA Nemotron 3 Nano (30B parameters, 3B active) and Nemotron 3 Super (120B parameters, 12B active). The integration includes supervised fine-tuning, reinforcement learning with verifiable rewards (RLVR), and reinforcement learning from AI feedback (RLAIF).

product updateJuly 14, 2026

Apple releases iOS 27 public beta with AI-powered Siri overhaul built on Apple-Google Foundation Models

Apple released the iOS 27 public beta, making its AI-powered Siri overhaul available to all users for the first time beyond developers. The assistant leverages Apple Foundation Models built in collaboration with Google Gemini, running on-device with Private Cloud Compute across Apple's 2.5 billion active devices.

AWS launches dataset management in Bedrock AgentCore for versioned agent test suites

AWS launches dataset management in Bedrock AgentCore for versioned agent test suites

Two scenario types for different testing needs

Versioning solves the moving baseline problem

Ground truth distinguishes correctness from appearance

What this means

Related Articles

AWS Extends QA Studio with Test Suites and CI/CD CLI for Automated Regression Testing

Amazon Nova Act Brings Vision-Based Web Navigation to UX Testing, No Hard-Coded Scripts Required

AWS Adds NVIDIA Nemotron 3 Nano (30B) and Super (120B) to SageMaker Serverless Fine-Tuning

Apple releases iOS 27 public beta with AI-powered Siri overhaul built on Apple-Google Foundation Models

Comments