Loading...
Loading...
Available on 1 platform
Sign in to view source links and access this dataset
500 human-validated GitHub Issue-Pull Request pairs from popular Python repositories, curated by the SWE-bench team. This subset of the original benchmark focuses on high-quality samples verified for evaluation accuracy through manual review. It serves as a rigorous test for autonomous systems attempting to solve real-world software engineering tasks.
Evaluation requires a specific unit test verification environment as described in the SWE-bench documentation to execute the post-PR behavior checks.