The benchmark you used to evaluate your agent was retired — 59% of its test cases were wrong

Published: 2026-03-29 Last verified: 2026-04-18 secondary-research
10 claims 0 tested landscape

The benchmark you used to evaluate your agent was retired — 59% of its test cases were wrong

From Theory Delta | Methodology | Published 2026-03-29 | Updated 2026-04-18

Your team used SWE-bench scores to make a deployment decision. A vendor cited 74% on SWE-bench Verified in their product announcement. You are treating a model that scores 70%+ on SWE-bench as production-ready for software engineering tasks.

That benchmark was retired in February 2026. More than half of its test cases were wrong.

What you expect

SWE-bench Verified is the standard signal for coding agent production capability. A model scoring 70%+ is production-ready. Benchmark scores are comparable across entries on the same leaderboard. Single-run scores are reliable estimates of capability.

What actually happens

OpenAI retired SWE-bench Verified in February 2026 after auditing found 59.4% of test cases were flawed — incorrect gold patches, ambiguous task descriptions, or contaminated evaluation sets. Models at the top of the historical leaderboard were partially benefiting from flawed tasks. Published benchmark scores going back to 2024 cannot be treated as accurate measurements of model capability.

Gold patch failures confirmed on specific tasks: jqlang__jq-2681 and tokio-rs__tokio-4384 have incorrect reference solutions. Any model that attempts the correct solution on these tasks scores lower than a model that mimics the wrong gold patch. The benchmark ceiling is unreachable on these tasks by design.

The production gap is 6.8x. ACE-Bench measures real-world task completion against SWE-bench scores on the same models. For Claude Opus 4.5: SWE-bench score of 74.4% vs real-world end-to-end task completion of 11.0%. This is not a measurement artifact. Benchmark tasks are isolated, well-specified, and reversible. Production tasks are embedded in context, ambiguous, and consequential.

Meimandi et al. (arXiv:2506.02064) surveyed agent evaluation literature: 83% of papers use technical metrics only (no task-completion or user-outcome measures), and fewer than 25% of forecast benchmark returns are realized in production deployments.

The frontier scores on a broken benchmark. As of March 2026, the SWE-bench leaderboard reaches 79.2% (Sonar Foundation Agent + Claude 4.5 Opus; live-SWE-agent + Claude 4.5 Opus medium). These scores exist on a benchmark where 59.4% of test cases were flawed. The 74-79% band is densely contested:

ScoreEntryNotes
79.2%Sonar Foundation Agent + Claude 4.5 Opusfrontier
78.8%TRAE + Doubao-Seed-CodeByteDance, multi-model stack
76.8%EPAM AI/Run Developer Agent + Claude 4 Sonnetcommercial tool
76.8%Atlassian Rovo Devcommercial tool, public benchmark
75.6%Warpcommercial IDE agent
74.8%Harness AIcommercial CI/CD vendor

Commercial tool entries as a signal: Warp, Atlassian Rovo Dev, and Harness AI appearing on the public leaderboard indicates production coding tools are treating SWE-bench as a marketing surface. This increases the risk that scaffold choices are optimized for leaderboard performance rather than general-purpose coding tasks.

Scaffold-driven distribution shift. Agent evaluation performance depends on the scaffolding framework wrapping the model, not just the model. mini-SWE-agent has emerged as the dominant evaluation harness, appearing with at least five distinct model configurations — harness choice is now a material variable in published scores. A 5pp ranking difference may be attributable to scaffold choices, not capability. (arXiv:2603.23749)

Single-pass scores overstate production reliability by 20-30 percentage points

tau-bench provides Pass^k evaluation — running the same task k times and measuring the fraction that pass on all k trials. A model with 69% Pass^1 in the retail domain drops to approximately 46% at Pass^4. Single-pass benchmark scores are unsuitable for production deployment decisions.

Your tool selection accuracy collapses at production-scale tool counts

Tool selection accuracy degrades sharply with tool count:

Tool countSelection accuracy
Small tool set (baseline)43%
100 tools14%
Total degradation range13.9% to 85% depending on context

Benchmarks that test agents with 5-10 tools do not predict performance in production agents with 50-100 tools. No current standard benchmark tests tool selection at production-scale tool counts.

Ragas faithfulness silent false positive

Ragas faithfulness metric issue #2248: the metric returns a perfect 1.0 score when retrieval context is empty. An evaluation pipeline using Ragas faithfulness will report perfect scores for a retrieval system that retrieves nothing. This affects any RAG evaluation pipeline that does not validate retrieval context non-emptiness before scoring.

SCAM benchmark: bimodal improvement obscured by averages

The 1Password SCAM benchmark’s headline ~40pp average improvement with skill injection is misleading. The distribution is bimodal:

Model tierSkill improvement
Already-strong models (GPT-4o, Claude 3.7)+6 to +24 pp
Weaker models (GPT-4o-mini, Gemini Flash)+49 to +60 pp

The pattern generalizes: any benchmark reporting a single average improvement across model tiers likely masks a bimodal distribution. Embedded credentials defeat all 8 tested models even with SCAM active — the benchmark does not flag this as a distinct failure class.

What this means for you

If you used SWE-bench scores in a deployment decision: You made that decision against a benchmark where 59.4% of test cases were flawed. The scores you compared are measuring performance on broken tasks, not production capability. The ACE-Bench gap — 74.4% SWE-bench, 11.0% real-world completion — is the empirical translation of that benchmark score into actual capability.

If a vendor cited SWE-bench Verified in a recent announcement: The benchmark was retired in February 2026. If the citation is from a product announcement after February 2026, it is citing a retired benchmark. If it is from before February 2026, the scores carry the systematic bias of the flawed test cases.

If you have 50+ tools in your production agent: The 43% → 14% accuracy drop at 100 tools is not reflected in any public leaderboard. Your production agent is almost certainly underperforming its benchmark score by a margin that no current benchmark captures.

If you are using Ragas faithfulness in your RAG evaluation pipeline: Your pipeline may be reporting perfect scores for a retrieval system that retrieves nothing. Validate that retrieval context is non-empty before applying the faithfulness metric on every run.

If you are comparing multi-model ensemble benchmark entries to single-model entries: TRAE uses four models, Refact.ai combines two models — their SWE-bench scores encode scaffold choices, not model capability. Scores are not portable across evaluation frameworks.

What to do

Stop citing SWE-bench as production-capability evidence. The benchmark is retired. SWE-bench Pro (Scale AI) is the OpenAI-recommended replacement; top agents currently score 55-59% there — substantially lower than the SWE-bench Verified numbers they replaced.

Require Pass^k at k >= 3 for any production deployment evaluation. A single-run score is an upper bound, not a central estimate. For long-horizon tasks, require k >= 5.

Profile tool selection accuracy at your actual tool count. If your production agent uses 50+ tools, benchmark performance at that count specifically. The 43% → 14% drop at 100 tools is not in any public leaderboard.

For RAG evaluation: Validate that retrieval context is non-empty before applying Ragas faithfulness. Issue #2248 remains open — the check must be added at the pipeline level.

When comparing agent benchmark scores across papers: Identify the scaffold (evaluation harness, tool-call protocol, retry logic) used during evaluation. Scores are not portable across scaffolds. mini-SWE-agent is the dominant harness — entries using different harnesses are not directly comparable.

Replace SWE-bench references in internal decision-making with ACE-Bench or task-specific benchmarks with independent test case validation and contamination controls.

Evidence

ToolVersionResult
SWE-bench Verifiedretired Feb 2026independently-confirmed: 59.4% flawed test cases; retired by OpenAI
tau-benchcurrent (March 2026)source-reviewed: Pass^1 69% → Pass^4 ~46% in retail domain
Ragas faithfulnesscurrent (March 2026)independently-confirmed: issue #2248 — returns 1.0 with empty context
1Password SCAMv1.0source-reviewed: bimodal distribution; embedded credentials universal failure

Confidence: secondary-research — SWE-bench retirement independently confirmed by OpenAI’s official announcement. Ragas issue #2248 is an independent third-party report. arXiv:2506.02064 (Meimandi et al.) provides independent confirmation of the production gap pattern. ACE-Bench results are secondary-research (no direct execution of the benchmark in this investigation).

Falsification criterion: This claim would be disproved by a demonstration that SWE-bench Verified’s 59.4% flaw rate was a miscalculation and OpenAI reversed the retirement, OR by ACE-Bench measurements showing the production gap is less than 2x (not 6.8x) for the same models on comparable task sets.

Unverified: Claude Sonnet 5 “Fennec” (claude-sonnet-5@20260203) is absent from the SWE-bench Pro leaderboard as of March 5, 2026. No official Anthropic primary source confirms this model designation. Do not cite Sonnet 5 benchmark comparisons until a primary source appears.

Open questions: (1) What is the contamination rate on SWE-bench Pro? (2) Does tau-bench Pass^k decay generalize to domains other than retail and airline? (3) Is the Ragas faithfulness bug fixed in current releases?

Seen different? Contribute your evidence — theory delta is what makes this knowledge base work.