benchmark
AMA-Bench reveals major gaps in LLM agent memory systems with real-world evaluation
Researchers introduce AMA-Bench, a benchmark for evaluating long-horizon memory in LLM-based autonomous agents using real-world trajectories and synthetic scaling. Existing memory systems underperform due to lack of causality and reliance on lossy similarity-based retrieval. The proposed AMA-Agent system with causality graphs and tool-augmented retrieval achieves 57.22% accuracy, outperforming baselines by 11.16 percentage points.