The benchmark you used to evaluate your agent was retired — 59% of its test cases were wrong
The benchmark you used to evaluate your agent was retired — 59% of its test cases were wrong
From Theory Delta | Methodology | Published 2026-03-29 | Updated 2026-04-18
Your team used SWE-bench scores to make a deployment decision. A vendor cited 74% on SWE-bench Verified in their product announcement. You are treating a model that scores 70%+ on SWE-bench as production-ready for software engineering tasks.
That benchmark was retired in February 2026. More than half of its test cases were wrong.
What you expect
SWE-bench Verified is the standard signal for coding agent production capability. A model scoring 70%+ is production-ready. Benchmark scores are comparable across entries on the same leaderboard. Single-run scores are reliable estimates of capability.
What actually happens
OpenAI retired SWE-bench Verified in February 2026 after auditing found 59.4% of test cases were flawed — incorrect gold patches, ambiguous task descriptions, or contaminated evaluation sets. Models at the top of the historical leaderboard were partially benefiting from flawed tasks. Published benchmark scores going back to 2024 cannot be treated as accurate measurements of model capability.
Gold patch failures confirmed on specific tasks: jqlang__jq-2681 and tokio-rs__tokio-4384 have incorrect reference solutions. Any model that attempts the correct solution on these tasks scores lower than a model that mimics the wrong gold patch. The benchmark ceiling is unreachable on these tasks by design.
The production gap is 6.8x. ACE-Bench measures real-world task completion against SWE-bench scores on the same models. For Claude Opus 4.5: SWE-bench score of 74.4% vs real-world end-to-end task completion of 11.0%. This is not a measurement artifact. Benchmark tasks are isolated, well-specified, and reversible. Production tasks are embedded in context, ambiguous, and consequential.
Meimandi et al. (arXiv:2506.02064) surveyed agent evaluation literature: 83% of papers use technical metrics only (no task-completion or user-outcome measures), and fewer than 25% of forecast benchmark returns are realized in production deployments.
The frontier scores on a broken benchmark. As of March 2026, the SWE-bench leaderboard reaches 79.2% (Sonar Foundation Agent + Claude 4.5 Opus; live-SWE-agent + Claude 4.5 Opus medium). These scores exist on a benchmark where 59.4% of test cases were flawed. The 74-79% band is densely contested:
| Score | Entry | Notes |
|---|---|---|
| 79.2% | Sonar Foundation Agent + Claude 4.5 Opus | frontier |
| 78.8% | TRAE + Doubao-Seed-Code | ByteDance, multi-model stack |
| 76.8% | EPAM AI/Run Developer Agent + Claude 4 Sonnet | commercial tool |
| 76.8% | Atlassian Rovo Dev | commercial tool, public benchmark |
| 75.6% | Warp | commercial IDE agent |
| 74.8% | Harness AI | commercial CI/CD vendor |
Commercial tool entries as a signal: Warp, Atlassian Rovo Dev, and Harness AI appearing on the public leaderboard indicates production coding tools are treating SWE-bench as a marketing surface. This increases the risk that scaffold choices are optimized for leaderboard performance rather than general-purpose coding tasks.
Scaffold-driven distribution shift. Agent evaluation performance depends on the scaffolding framework wrapping the model, not just the model. mini-SWE-agent has emerged as the dominant evaluation harness, appearing with at least five distinct model configurations — harness choice is now a material variable in published scores. A 5pp ranking difference may be attributable to scaffold choices, not capability. (arXiv:2603.23749)
Single-pass scores overstate production reliability by 20-30 percentage points
tau-bench provides Pass^k evaluation — running the same task k times and measuring the fraction that pass on all k trials. A model with 69% Pass^1 in the retail domain drops to approximately 46% at Pass^4. Single-pass benchmark scores are unsuitable for production deployment decisions.
Your tool selection accuracy collapses at production-scale tool counts
Tool selection accuracy degrades sharply with tool count:
| Tool count | Selection accuracy |
|---|---|
| Small tool set (baseline) | 43% |
| 100 tools | 14% |
| Total degradation range | 13.9% to 85% depending on context |
Benchmarks that test agents with 5-10 tools do not predict performance in production agents with 50-100 tools. No current standard benchmark tests tool selection at production-scale tool counts.
Ragas faithfulness silent false positive
Ragas faithfulness metric issue #2248: the metric returns a perfect 1.0 score when retrieval context is empty. An evaluation pipeline using Ragas faithfulness will report perfect scores for a retrieval system that retrieves nothing. This affects any RAG evaluation pipeline that does not validate retrieval context non-emptiness before scoring.
SCAM benchmark: bimodal improvement obscured by averages
The 1Password SCAM benchmark’s headline ~40pp average improvement with skill injection is misleading. The distribution is bimodal:
| Model tier | Skill improvement |
|---|---|
| Already-strong models (GPT-4o, Claude 3.7) | +6 to +24 pp |
| Weaker models (GPT-4o-mini, Gemini Flash) | +49 to +60 pp |
The pattern generalizes: any benchmark reporting a single average improvement across model tiers likely masks a bimodal distribution. Embedded credentials defeat all 8 tested models even with SCAM active — the benchmark does not flag this as a distinct failure class.
What this means for you
If you used SWE-bench scores in a deployment decision: You made that decision against a benchmark where 59.4% of test cases were flawed. The scores you compared are measuring performance on broken tasks, not production capability. The ACE-Bench gap — 74.4% SWE-bench, 11.0% real-world completion — is the empirical translation of that benchmark score into actual capability.
If a vendor cited SWE-bench Verified in a recent announcement: The benchmark was retired in February 2026. If the citation is from a product announcement after February 2026, it is citing a retired benchmark. If it is from before February 2026, the scores carry the systematic bias of the flawed test cases.
If you have 50+ tools in your production agent: The 43% → 14% accuracy drop at 100 tools is not reflected in any public leaderboard. Your production agent is almost certainly underperforming its benchmark score by a margin that no current benchmark captures.
If you are using Ragas faithfulness in your RAG evaluation pipeline: Your pipeline may be reporting perfect scores for a retrieval system that retrieves nothing. Validate that retrieval context is non-empty before applying the faithfulness metric on every run.
If you are comparing multi-model ensemble benchmark entries to single-model entries: TRAE uses four models, Refact.ai combines two models — their SWE-bench scores encode scaffold choices, not model capability. Scores are not portable across evaluation frameworks.
What to do
Stop citing SWE-bench as production-capability evidence. The benchmark is retired. SWE-bench Pro (Scale AI) is the OpenAI-recommended replacement; top agents currently score 55-59% there — substantially lower than the SWE-bench Verified numbers they replaced.
Require Pass^k at k >= 3 for any production deployment evaluation. A single-run score is an upper bound, not a central estimate. For long-horizon tasks, require k >= 5.
Profile tool selection accuracy at your actual tool count. If your production agent uses 50+ tools, benchmark performance at that count specifically. The 43% → 14% drop at 100 tools is not in any public leaderboard.
For RAG evaluation: Validate that retrieval context is non-empty before applying Ragas faithfulness. Issue #2248 remains open — the check must be added at the pipeline level.
When comparing agent benchmark scores across papers: Identify the scaffold (evaluation harness, tool-call protocol, retry logic) used during evaluation. Scores are not portable across scaffolds. mini-SWE-agent is the dominant harness — entries using different harnesses are not directly comparable.
Replace SWE-bench references in internal decision-making with ACE-Bench or task-specific benchmarks with independent test case validation and contamination controls.
Evidence
| Tool | Version | Result |
|---|---|---|
| SWE-bench Verified | retired Feb 2026 | independently-confirmed: 59.4% flawed test cases; retired by OpenAI |
| tau-bench | current (March 2026) | source-reviewed: Pass^1 69% → Pass^4 ~46% in retail domain |
| Ragas faithfulness | current (March 2026) | independently-confirmed: issue #2248 — returns 1.0 with empty context |
| 1Password SCAM | v1.0 | source-reviewed: bimodal distribution; embedded credentials universal failure |
Confidence: secondary-research — SWE-bench retirement independently confirmed by OpenAI’s official announcement. Ragas issue #2248 is an independent third-party report. arXiv:2506.02064 (Meimandi et al.) provides independent confirmation of the production gap pattern. ACE-Bench results are secondary-research (no direct execution of the benchmark in this investigation).
Falsification criterion: This claim would be disproved by a demonstration that SWE-bench Verified’s 59.4% flaw rate was a miscalculation and OpenAI reversed the retirement, OR by ACE-Bench measurements showing the production gap is less than 2x (not 6.8x) for the same models on comparable task sets.
Unverified: Claude Sonnet 5 “Fennec” (claude-sonnet-5@20260203) is absent from the SWE-bench Pro leaderboard as of March 5, 2026. No official Anthropic primary source confirms this model designation. Do not cite Sonnet 5 benchmark comparisons until a primary source appears.
Open questions: (1) What is the contamination rate on SWE-bench Pro? (2) Does tau-bench Pass^k decay generalize to domains other than retail and airline? (3) Is the Ragas faithfulness bug fixed in current releases?
Seen different? Contribute your evidence — theory delta is what makes this knowledge base work.