The best AI tested with DeepSWE

The best AI tested with DeepSWE
News

DataCurve has published DeepSWE, a new benchmark for AI coding agents that need to perform longer and more realistic software tasks. The benchmark is intended as an alternative to existing tests such as SWE-Bench Pro, which DataCurve considers too narrow, too polluted, and sometimes unreliable.

DeepSWE contains 113 original tasks in 91 open-source repositories, distributed across five programming languages: TypeScript, Go, Python, JavaScript, and Rust. The tasks have not been copied from existing pull requests or commits. As a result, there is a lower chance that models have already seen the solution during training.

The benchmark is particularly demanding than existing tests. The prompts are shorter than in SWE-Bench Pro, but the solutions require significantly more work on average: approximately 668 added lines of code and changes across seven files. This is intended to better resemble how developers use real coding agents: with short commands where the agent must figure out where the change belongs.

An important point is verification. DataCurve writes its own verifiers that test behavior via public APIs and visible output, rather than checking whether a solution exactly resembles the reference patch. In an audit, DataCurve found that SWE-Bench Pro gave a relatively high number of erroneous assessments, whereas DeepSWE, according to their analysis, is much closer to true task correctness.

On the leaderboard, GPT-5.5 scores the highest at 70 percent. GPT-5.4 follows with 56 percent and Claude Opus 4.7 with 54 percent. After that, the score drops rapidly: Claude Sonnet 4.6 achieves 32 percent and Gemini 3.5 Flash 28 percent. DataCurve states that DeepSWE therefore makes a clearer distinction between frontier coding agents than benchmarks where models cluster closely together.

This is relevant for developers and companies because agent benchmarks are increasingly influencing model selection. DeepSWE shows that a model that scores well on short, known tasks is not automatically good at long-term repository work. The news is also a warning: benchmark results remain dependent on task selection, harness, costs, and verification quality.