LLM News | TPS

benchmark

New benchmark reveals LLMs struggle with graduate-level math and computational reasoning

Researchers have released CompMath-MCQ, a new benchmark dataset containing 1,500 originally authored graduate-level mathematics questions designed to test LLM performance on advanced topics. The dataset covers linear algebra, numerical optimization, vector calculus, probability, and Python-based scientific computing—areas largely absent from existing math benchmarks. Baseline testing with state-of-the-art LLMs indicates that advanced computational mathematical reasoning remains a significant challenge.

March 5, 2026 · 5:55 AM2 min read

benchmark mathematics LLM-evaluation

via arxiv.org ↗

research

Code agents can evolve math problems into harder variants, study finds

A new study demonstrates that code agents can autonomously evolve existing math problems into more complex, solvable variations through systematic exploration. The multi-agent framework addresses a critical bottleneck in training advanced LLMs toward IMO-level mathematical reasoning by providing a scalable mechanism for synthesizing high-difficulty problems.

March 5, 2026 · 1:38 AM2 min read

research code-agents mathematics

via arxiv.org ↗