LLM News | TPS

research

Half of AI code passing SWE-bench would be rejected by real developers, METR study finds

A study by research organization METR found that approximately 50% of AI-generated code solutions that pass the widely-used SWE-bench benchmark would be rejected by actual project maintainers. The finding exposes a significant gap between industry-standard code generation benchmarks and real-world code review standards.

March 11, 2026 · 6:05 PM2 min read

ai-code-generation benchmarking swe-bench

via the-decoder.com ↗