OpenAI says SWE-bench Verified is broken—most tasks reject correct solutions
OpenAI is calling for the retirement of SWE-bench Verified, the widely-used AI coding benchmark, claiming most tasks are flawed enough to reject correct solutions. The company argues that leading AI models have likely seen the answers during training, meaning benchmark scores measure memorization rather than genuine coding ability.
OpenAI has declared SWE-bench Verified—one of the most prominent benchmarks for measuring AI coding ability—fundamentally broken and unsuitable for continued use.
The company claims the benchmark has two critical failures: most tasks contain flaws that cause them to reject solutions that are actually correct, and leading AI models have likely encountered the benchmark data during training, making scores reflect memorization rather than real coding capability.
The Problem
SWE-bench Verified has been the standard metric for evaluating AI coding systems, with major labs competing to achieve higher scores. The benchmark consists of real GitHub issues paired with their fixes, designed to test whether models can solve actual software engineering problems.
OpenAI's critique centers on two dimensions:
Task Quality: A substantial portion of benchmark tasks contain errors in how they validate solutions. This means correct code gets marked as wrong, inflating difficulty metrics and making performance comparisons meaningless.
Data Contamination: Leading models have likely encountered SWE-bench tasks or similar data during their training phases. This converts what should measure problem-solving ability into a measure of how well models retained training data.
Broader Implications
The critique raises questions about how the AI field validates progress in code generation. If the most widely-cited benchmark is compromised, then claimed improvements across multiple models may not reflect genuine capability gains.
This announcement follows a pattern where OpenAI has been more critical of benchmarking as a metric. The company has previously questioned whether traditional benchmarks capture real-world usefulness, particularly for reasoning and coding tasks where edge cases matter more than aggregate scores.
The issue also affects the entire competitive landscape. If models have been overfitted to SWE-bench Verified through training data exposure, then their rankings don't accurately represent which systems actually perform better on novel coding problems.
What This Means
OpenAI's call to retire SWE-bench Verified signals that the AI community needs better evaluation frameworks for coding tasks. Rather than relying on static benchmarks vulnerable to contamination and task-design errors, the field likely needs dynamic benchmarks, real-time problem sets, or evaluation methods that reduce data leakage during training.
For practitioners and researchers, this is a reminder that high benchmark scores don't guarantee real-world performance—particularly when the benchmark itself has known flaws and potential data contamination issues.
Related Articles
OpenAI launches scheduled tasks in ChatGPT, replacing Pulse feature in 14 days
OpenAI has launched scheduled tasks in ChatGPT, allowing users to automate reminders, recurring work, and monitoring. The feature is rolling out today to Plus, Pro, Business, and Enterprise users, and will replace the existing Pulse feature in 14 days.
OpenAI rolls out ChatGPT Lockdown mode to all users to block prompt injection data theft
OpenAI has expanded Lockdown mode to all ChatGPT plan tiers, including Free, Go, Plus, Pro, and Business users. The security feature blocks outbound network requests to prevent prompt injection attacks from stealing sensitive data, but disables live web browsing, Deep Research, and Agent mode.
OpenAI's ChatGPT Memory V3 now profiles users across all conversations, raises accuracy and privacy concerns
OpenAI has deployed Dreaming V3, a background memory synthesis system that builds comprehensive user profiles from chat history. The company reports factual task recall jumped from 41% in 2024 to 82% in 2026, while reducing compute costs by 5X. However, testing reveals the system stores outdated and incorrect information that persists even when users disable memory features.
OpenAI plans ChatGPT redesign to integrate coding tools, image generation, and third-party apps
OpenAI will roll out a redesigned ChatGPT interface in the coming weeks that integrates coding tools, image generation capabilities, and third-party applications from partners including Canva and Booking.com. The overhaul, first reported by The Financial Times, aims to shift users from simple chat interactions to multi-task workflows, particularly targeting enterprise customers.
Comments
Loading...