research

Code agents can evolve math problems into harder variants, study finds

A new study demonstrates that code agents can autonomously evolve existing math problems into more complex, solvable variations through systematic exploration. The multi-agent framework addresses a critical bottleneck in training advanced LLMs toward IMO-level mathematical reasoning by providing a scalable mechanism for synthesizing high-difficulty problems.

March 5, 2026 · 1:38 AM2 min read

Code Agents Can Evolve Math Problems Into Harder Variants

Researchers have demonstrated that code agents can autonomously generate more difficult mathematical problems from existing ones—a potentially significant solution to a training data bottleneck as large language models pursue IMO-level mathematical capabilities.

The study, titled "Code2Math," introduces a multi-agent framework that uses code execution as a scalable environment for mathematical experimentation. Rather than relying on human-curated problem sets, the system employs code agents to evolve existing problems into structurally distinct, more challenging variations while validating both solvability and increased difficulty.

The Problem

As LLMs advance toward IMO-level performance, the supply of challenging, high-quality mathematical problems for training and evaluation has become critically constrained. Manually generating such problems at scale is impractical, creating a ceiling for further capability improvements in mathematical reasoning.

The Approach

The researchers hypothesized that code agents—which have demonstrated sophisticated reasoning and coding skills—could serve as a mechanism for exploring mathematical problem space. Their framework operates through multi-agent collaboration to:

Take existing math problems as input
Autonomously generate variations that are more complex than originals
Validate that generated problems remain solvable
Confirm increased difficulty compared to source problems

The approach leverages code execution as both an experimentation environment and a validation mechanism. By using code to manipulate, test, and verify problems, the system achieves deterministic validation of mathematical correctness.

Key Findings

Empirical results show that code agents, given sufficient test-time exploration budget, can synthesize new problems that are:

Structurally distinct from originals (not trivial modifications)
More challenging than source problems (measurable difficulty increase)
Reliably solvable (validated through code execution)

The study provides the first empirical evidence that code-driven agents are viable for synthesizing high-difficulty mathematical reasoning problems at scale within computational environments.

What This Means

If validated at scale, this approach could break the training data bottleneck limiting mathematical LLM development. Rather than waiting for human problem setters, teams could automatically generate synthetic problem variations with verified difficulty and solvability. This has direct implications for:

Model training: More diverse, harder problems accelerate capability development toward IMO-level reasoning.

Benchmarking: Reduces the risk of performance saturation on existing benchmark sets by continuously generating new evaluation challenges.

Scalability: Shifts problem generation from human-constrained to computationally-scalable processes.

The approach also suggests broader applications—code-driven agents might evolve problems in other domains requiring systematic reasoning and verifiable solutions. The researchers have released their code and data, enabling reproduction and extension of the methodology.

Source: arxiv.org ↗

research code-agents mathematics llm-training problem-synthesis benchmark agentic-ai