benchmark

New benchmark reveals LLMs struggle with graduate-level math and computational reasoning

Researchers have released CompMath-MCQ, a new benchmark dataset containing 1,500 originally authored graduate-level mathematics questions designed to test LLM performance on advanced topics. The dataset covers linear algebra, numerical optimization, vector calculus, probability, and Python-based scientific computing—areas largely absent from existing math benchmarks. Baseline testing with state-of-the-art LLMs indicates that advanced computational mathematical reasoning remains a significant challenge.

2 min read

Researchers have introduced CompMath-MCQ, a new benchmark dataset for evaluating large language models on graduate-level mathematical reasoning. The benchmark consists of 1,500 multiple-choice questions—each with three options and exactly one correct answer—authored by professors teaching graduate-level courses.

What the Dataset Covers

CompMath-MCQ addresses a significant gap in existing LLM evaluation. While current benchmarks focus heavily on elementary mathematics, competition-style problems, or formal theorem proving, this dataset targets advanced computational mathematics rarely assessed at scale. Topics include:

  • Linear Algebra
  • Numerical Optimization
  • Vector Calculus
  • Probability Theory
  • Python-based scientific computing

All 1,500 questions are newly created specifically for this benchmark. Researchers explicitly avoided sourcing from existing materials to prevent data leakage—a common issue where models may have encountered training data from publicly available problem sets.

Validation Methodology

Question quality was verified through two mechanisms. First, researchers employed a cross-LLM disagreement procedure where inconsistent model answers signal potentially ambiguous or poorly constructed questions. Second, expert human reviewers manually validated all questions to ensure correctness and clarity.

The multiple-choice format enables objective, reproducible evaluation using the lm_eval library, removing subjective grading bias from assessments.

Baseline Results

According to the paper, baseline testing with state-of-the-art LLMs indicates that advanced computational mathematical reasoning remains a significant challenge. The authors did not disclose specific performance scores in the abstract, but emphasize that current models show meaningful room for improvement on these graduate-level problems.

What This Means

CompMath-MCQ fills a critical gap in LLM evaluation methodology. Most existing math benchmarks test relatively narrow skill sets—either basic algebra and arithmetic, or highly specialized formal reasoning. Graduate-level mathematics requires integration of multiple concepts, practical computational knowledge, and the ability to apply theoretical principles to real problems.

The creation of this benchmark suggests the research community is increasingly focused on assessing practical mathematical capability rather than performance on standardized test formats. For model developers, this offers a more rigorous evaluation framework that may better predict real-world utility in scientific computing, engineering, and research applications where graduate-level math is a requirement.

The dataset is freely available on GitHub, enabling reproducible benchmarking across the broader AI research community.