benchmark

AMA-Bench reveals major gaps in LLM agent memory systems with real-world evaluation

Researchers introduce AMA-Bench, a benchmark for evaluating long-horizon memory in LLM-based autonomous agents using real-world trajectories and synthetic scaling. Existing memory systems underperform due to lack of causality and reliance on lossy similarity-based retrieval. The proposed AMA-Agent system with causality graphs and tool-augmented retrieval achieves 57.22% accuracy, outperforming baselines by 11.16 percentage points.

2 min read

AMA-Bench: Evaluating Long-Horizon Memory for Agentic Applications

A new benchmark released on arXiv exposes a critical gap between how LLM agent memory is currently evaluated and how it actually performs in production agentic applications.

The problem is straightforward: existing benchmarks for LLM memory focus primarily on dialogue-centric, human-agent interactions. Real autonomous agents, however, operate in continuous streams of agent-environment interactions populated almost entirely by machine-generated representations—a fundamentally different evaluation scenario.

What AMA-Bench Measures

AMA-Bench (Agent Memory with Any length) introduces two evaluation components:

  1. Real-world agentic trajectories across representative agent applications, paired with expert-curated question-answer pairs
  2. Synthetic agentic trajectories that scale to arbitrary horizons with rule-based QA

This dual approach allows researchers to evaluate memory systems on both authentic agent behavior and controlled scaling scenarios—something existing benchmarks cannot provide.

The Memory Problem

The research reveals why current memory systems fail on realistic agent tasks. The three core limitations are:

  • Lack of causality information: Memory systems don't capture causal relationships between agent actions and outcomes
  • Missing objective context: Systems strip away critical factual information needed to reason about past interactions
  • Lossy retrieval methods: Similarity-based retrieval, dominant in current approaches, loses information during compression

These weaknesses compound as agent horizon length increases, making long-running autonomous applications unreliable.

AMA-Agent Solution

The researchers propose AMA-Agent, a memory architecture addressing these limitations:

  • Causality graphs that explicitly encode causal relationships between agent actions
  • Tool-augmented retrieval that supplements similarity matching with structured tool interactions

On AMA-Bench evaluation, AMA-Agent achieves 57.22% average accuracy, surpassing the strongest baseline memory systems by 11.16 percentage points.

What This Means

This work exposes a crucial evaluation gap that has likely masked real-world agent reliability problems. Current production agents may be deploying memory systems that perform well on dialogue benchmarks but fail at the continuous, machine-generated interaction streams they actually encounter. The 11-point improvement with causal reasoning suggests that explicit causal structure—rather than just more parameters or better embeddings—is fundamental to agent memory. For researchers building agentic applications, AMA-Bench provides the first realistic evaluation standard for memory systems. For deployed systems, it implies that rearchitecting memory around causality and structured retrieval may be necessary for reliable long-horizon agent performance.

AMA-Bench: Long-Horizon Memory Evaluation for AI Agents | TPS