OpenAI says SWE-bench Verified is broken—most tasks reject correct solutions
OpenAI is calling for the retirement of SWE-bench Verified, the widely-used AI coding benchmark, claiming most tasks are flawed enough to reject correct solutions. The company argues that leading AI models have likely seen the answers during training, meaning benchmark scores measure memorization rather than genuine coding ability.
OpenAI has declared SWE-bench Verified—one of the most prominent benchmarks for measuring AI coding ability—fundamentally broken and unsuitable for continued use.
The company claims the benchmark has two critical failures: most tasks contain flaws that cause them to reject solutions that are actually correct, and leading AI models have likely encountered the benchmark data during training, making scores reflect memorization rather than real coding capability.
The Problem
SWE-bench Verified has been the standard metric for evaluating AI coding systems, with major labs competing to achieve higher scores. The benchmark consists of real GitHub issues paired with their fixes, designed to test whether models can solve actual software engineering problems.
OpenAI's critique centers on two dimensions:
Task Quality: A substantial portion of benchmark tasks contain errors in how they validate solutions. This means correct code gets marked as wrong, inflating difficulty metrics and making performance comparisons meaningless.
Data Contamination: Leading models have likely encountered SWE-bench tasks or similar data during their training phases. This converts what should measure problem-solving ability into a measure of how well models retained training data.
Broader Implications
The critique raises questions about how the AI field validates progress in code generation. If the most widely-cited benchmark is compromised, then claimed improvements across multiple models may not reflect genuine capability gains.
This announcement follows a pattern where OpenAI has been more critical of benchmarking as a metric. The company has previously questioned whether traditional benchmarks capture real-world usefulness, particularly for reasoning and coding tasks where edge cases matter more than aggregate scores.
The issue also affects the entire competitive landscape. If models have been overfitted to SWE-bench Verified through training data exposure, then their rankings don't accurately represent which systems actually perform better on novel coding problems.
What This Means
OpenAI's call to retire SWE-bench Verified signals that the AI community needs better evaluation frameworks for coding tasks. Rather than relying on static benchmarks vulnerable to contamination and task-design errors, the field likely needs dynamic benchmarks, real-time problem sets, or evaluation methods that reduce data leakage during training.
For practitioners and researchers, this is a reminder that high benchmark scores don't guarantee real-world performance—particularly when the benchmark itself has known flaws and potential data contamination issues.