benchmarks safetyNew

SWE-Bench

Benchmark evaluating LLMs on resolving real GitHub software issues.

weight 0.0Open SourceLaunched 2026-06-07

💸 No earnings reported yet

What it is

SWE-bench is a benchmark that tests language models on resolving real-world software engineering issues pulled from GitHub repositories. The de facto standard for measuring coding-agent capability.

How AI plugs in

Evaluates models by having them resolve real GitHub issues end-to-end, scoring whether the generated patch passes the repo's own tests.

★ Reviews

No reviews yet — be the first.

Your rating

What it is

How AI plugs in

★ Reviews

Discussion (0)