New benchmark reveals code agents struggle to understand software architecture
A new research benchmark called Theory of Code Space (ToCS) exposes a critical limitation in AI code agents: they cannot reliably build and maintain understanding of software architecture during codebase exploration. The benchmark places agents in procedurally generated Python projects with partial observability, revealing that even frontier LLM agents score poorly at discovering module dependencies and cross-cutting invariants.
New Benchmark Exposes Code Agents' Architectural Blindness
A research paper released on arXiv introduces Theory of Code Space (ToCS), a benchmark that quantifies a fundamental gap in how AI code agents understand software architecture. While these agents perform well on isolated coding tasks, they consistently fail when navigating complex multi-file systems that require understanding how dozens of modules interact.
The Core Problem
Researchers hypothesize that code agent failures stem from an inability to construct, maintain, and update coherent mental models of how software components relate. This architectural understanding—knowledge of module dependencies, cross-cutting invariants, and design intent—remains largely inaccessible to current models.
How ToCS Works
The benchmark operates through three key mechanisms:
Procedural codebase generation: A generator creates medium-complexity Python projects with four distinct types of module dependency edges—syntactic imports, config-driven dynamic wiring, and others—reflecting real discovery methods. Each codebase contains planted architectural constraints with verified ground truth.
Partial observability harness: Agents explore codebases under a limited budget, forcing them to make strategic choices about which modules to examine.
Structured belief probing: Agents must periodically externalize their understanding as structured JSON, creating a time-series of architectural comprehension. This approach reveals whether agents truly understand code structure or merely pattern-match surface-level syntax.
Experimental Results
Preliminary testing with four rule-based baselines and five frontier LLM agents from three providers reveals stark performance variation:
- F1 scores ranged from 0.129 to 0.646—a 5x gap separating weaker models from better performers
- LLM agents discovered semantic edge types that rule-based baselines missed entirely
- Weaker models scored below simple heuristics, indicating no clear advantage from model scale alone
A critical finding: belief externalization itself—accurately serializing internal understanding into structured JSON—proved a non-trivial capability that significantly confounds belief-probing benchmarks. Some agents may understand architecture better than their JSON outputs suggest.
New Evaluation Dimensions
The research introduces two novel evaluation concepts:
- Architectural Constraint Discovery: Code-specific metrics measuring whether agents identify design rules embedded in codebases
- Active-Passive Gap decomposition: Splitting spatial reasoning limitations into selection (choosing which modules to examine) and decision (correctly interpreting examined code) components
Implications
These results challenge assumptions about code agent capabilities. Current frontier models can identify individual dependencies but struggle to build coherent, queryable models of entire system architecture. This limitation directly impacts real-world use cases like refactoring, dependency analysis, and architectural migration—tasks requiring global understanding rather than local optimization.
The open-source toolkit is available at https://github.com/che-shr-cat/tocs, enabling community validation and extension of these findings.
What This Means
Code agents marketed for full-codebase reasoning should be stress-tested against benchmarks like ToCS before deployment on complex systems. The architecture understanding gap represents a genuine technical limitation, not merely a training data problem. Developers should expect these agents to struggle with cross-module refactoring, architectural decisions, and understanding how components interact at scale—tasks requiring the kind of coherent belief states the benchmark measures.