benchmarkOpenAI

OpenAI says SWE-bench Verified is broken—most tasks reject correct solutions

TL;DR

OpenAI is calling for the retirement of SWE-bench Verified, the widely-used AI coding benchmark, claiming most tasks are flawed enough to reject correct solutions. The company argues that leading AI models have likely seen the answers during training, meaning benchmark scores measure memorization rather than genuine coding ability.

2 min read
0

OpenAI has declared SWE-bench Verified—one of the most prominent benchmarks for measuring AI coding ability—fundamentally broken and unsuitable for continued use.

The company claims the benchmark has two critical failures: most tasks contain flaws that cause them to reject solutions that are actually correct, and leading AI models have likely encountered the benchmark data during training, making scores reflect memorization rather than real coding capability.

The Problem

SWE-bench Verified has been the standard metric for evaluating AI coding systems, with major labs competing to achieve higher scores. The benchmark consists of real GitHub issues paired with their fixes, designed to test whether models can solve actual software engineering problems.

OpenAI's critique centers on two dimensions:

Task Quality: A substantial portion of benchmark tasks contain errors in how they validate solutions. This means correct code gets marked as wrong, inflating difficulty metrics and making performance comparisons meaningless.

Data Contamination: Leading models have likely encountered SWE-bench tasks or similar data during their training phases. This converts what should measure problem-solving ability into a measure of how well models retained training data.

Broader Implications

The critique raises questions about how the AI field validates progress in code generation. If the most widely-cited benchmark is compromised, then claimed improvements across multiple models may not reflect genuine capability gains.

This announcement follows a pattern where OpenAI has been more critical of benchmarking as a metric. The company has previously questioned whether traditional benchmarks capture real-world usefulness, particularly for reasoning and coding tasks where edge cases matter more than aggregate scores.

The issue also affects the entire competitive landscape. If models have been overfitted to SWE-bench Verified through training data exposure, then their rankings don't accurately represent which systems actually perform better on novel coding problems.

What This Means

OpenAI's call to retire SWE-bench Verified signals that the AI community needs better evaluation frameworks for coding tasks. Rather than relying on static benchmarks vulnerable to contamination and task-design errors, the field likely needs dynamic benchmarks, real-time problem sets, or evaluation methods that reduce data leakage during training.

For practitioners and researchers, this is a reminder that high benchmark scores don't guarantee real-world performance—particularly when the benchmark itself has known flaws and potential data contamination issues.

Related Articles

analysis

Altman criticizes Anthropic's restricted Mythos cybersecurity model as 'fear-based marketing'

OpenAI CEO Sam Altman criticized Anthropic's new cybersecurity model Mythos during a podcast appearance, calling the company's decision to restrict public access 'fear-based marketing.' Anthropic claims Mythos is too powerful to release publicly due to potential weaponization by cybercriminals.

model release

OpenAI Releases GPT-5.4 Image 2 with 272K Context Window and Image Generation

OpenAI has released GPT-5.4 Image 2, combining the GPT-5.4 reasoning model with image generation capabilities. The multimodal model features a 272K token context window and is priced at $8 per million input tokens and $15 per million output tokens.

model release

OpenAI releases ChatGPT Images 2.0 with 3840x2160 resolution at $30 per 1M output tokens

OpenAI released ChatGPT Images 2.0, pricing output tokens at $30 per million with maximum resolution of 3840x2160 pixels. CEO Sam Altman claims the improvement from gpt-image-1 to gpt-image-2 equals the jump from GPT-3 to GPT-5.

model release

OpenAI releases ChatGPT Images 2.0 with integrated reasoning and text-image composition

OpenAI has released ChatGPT Images 2.0, which integrates reasoning capabilities to generate complex visual compositions combining text and images. The model supports aspect ratios from 3:1 to 1:3 and outputs up to 2K resolution, with advanced features available to Plus, Pro, Business, and Enterprise users.

Comments

Loading...