SWE-PRBench contains 350 pull requests annotated with human reviewer feedback to evaluate AI code review quality. Created by foundry-ai, this benchmark measures if large language models can identify the same issues flagged by human reviewers in production code changes.
Use Cases
- Benchmark large language models on their ability to identify issues in pull requests using the human-annotated ground truth.
- Analyze patterns in human reviewer feedback across the 350 pull requests to understand common review criteria.
- Train models for automated code review by using the pull request examples and corresponding human feedback as supervision.
Strengths
- Provides 350 real-world pull requests with human-annotated ground truth for evaluation.
- Focuses on a distinct task—evaluating code changes—compared to benchmarks that measure code generation.
- Dataset was last updated in March 2026, indicating recent maintenance.
Limitations
- The dataset size of 350 pull requests is relatively small for training large-scale machine learning models.
- Specific column structure, sample data, and file formats are not provided in the input, limiting immediate usability.
- Potential bias towards the types of projects and programming languages represented in the collected pull requests.
Provenance
- Source
- foundry-ai on Hugging Face
- Collection Method
- Collection of 350 pull requests with human-annotated reviewer feedback.
- Time Range
- null
- Freshness
- Last updated 2026-03-24.
- Geography
- null