RoboMME benchmark reveals memory architecture trade-offs in robotic vision-language models
Researchers introduce RoboMME, a large-scale standardized benchmark for evaluating memory in robotic vision-language-action (VLA) models across 16 manipulation tasks. The study tests 14 memory-augmented VLA variants and finds that no single memory architecture excels across all task types—each design offers distinct trade-offs depending on temporal, spatial, object, and procedural demands.
RoboMME Benchmark Reveals Memory Architecture Trade-Offs in Robotic Vision-Language Models
A new standardized benchmark for evaluating memory mechanisms in robotic vision-language-action (VLA) models reveals that memory architecture effectiveness is highly task-dependent, with no single design dominating across all scenarios.
The benchmark, called RoboMME, comprises 16 manipulation tasks designed to systematically test temporal, spatial, object, and procedural memory—critical capabilities for long-horizon, history-dependent robotic tasks like counting repeated actions or manipulating temporarily occluded objects.
Benchmark Structure and Scope
RoboMME addresses a critical gap in robotic AI evaluation. While recent VLA models have begun incorporating memory mechanisms to handle longer task horizons, their evaluations remain confined to narrow, non-standardized settings, limiting systematic comparison and progress measurement across the field.
The benchmark's taxonomy covers four memory dimensions:
- Temporal memory: Tracking sequences of actions over time
- Spatial memory: Maintaining awareness of object positions and relationships
- Object memory: Recognizing and distinguishing between entities
- Procedural memory: Remembering task steps and execution order
Experimental Methodology
Researchers developed 14 memory-augmented VLA variants, all built on the π0.5 backbone, to systematically explore different memory representations and integration strategies. This controlled approach enables direct comparison of architectural choices rather than confounding factors from different base models.
The study tests multiple memory mechanisms across varying integration approaches—examining how memory is stored, retrieved, and incorporated into the model's decision-making process.
Key Findings
Experimental results show that memory representation effectiveness varies significantly by task type. No single memory architecture achieves optimal performance across the benchmark:
- Different memory designs excel at different task dimensions
- Trade-offs exist between memory capacity, computational overhead, and task performance
- Task-specific characteristics determine which memory approach performs best
This finding suggests that task-aware memory design, rather than a universal memory mechanism, may be necessary for optimal robotic performance.
Implications for Robotic AI Development
RoboMME provides the field with standardized evaluation methodology for memory-augmented VLAs, enabling reproducible comparisons and accelerating development of more capable robotic systems. The benchmark's taxonomic structure allows researchers to identify which memory mechanisms address specific task challenges.
The publicly released code and video demonstrations at robomme.github.io enable other researchers to benchmark new memory approaches against the established baseline.
What This Means
The findings challenge assumptions that a single memory architecture can serve all robotic manipulation tasks. As VLA models scale to longer horizons and more complex behaviors, memory design becomes increasingly critical. RoboMME establishes infrastructure for systematic evaluation of these mechanisms, but also demonstrates that robotic systems may require adaptive or task-specific memory strategies rather than monolithic solutions. This work adds rigor to an area—robotic learning—where evaluation standards have historically been fragmented.