Name: Pull Request Review Benchmark With 350 Human-Annotated Examples
Creator: foundry-ai
Published: 2026-03-24T19:41:39
Keywords: Librarypolars, Languageen, Size Categoriesn1 K, Modalitytext, Modalitytabular, Librarymlcroissant, Software Engineering, Librarydatasets, Benchmark, Librarypandas, Llm Evaluation, Licensecc By 40, Code Review, Regionus, Pull Requests, Task Categoriestext Classification, JSON

Description

SWE-PRBench contains 350 pull requests annotated with human reviewer feedback to evaluate AI code review quality. Created by foundry-ai, this benchmark measures if large language models can identify the same issues flagged by human reviewers in production code changes.

Use Cases

Benchmark large language models on their ability to identify issues in pull requests using the human-annotated ground truth.
Analyze patterns in human reviewer feedback across the 350 pull requests to understand common review criteria.
Train models for automated code review by using the pull request examples and corresponding human feedback as supervision.

Strengths

Provides 350 real-world pull requests with human-annotated ground truth for evaluation.
Focuses on a distinct task—evaluating code changes—compared to benchmarks that measure code generation.
Dataset was last updated in March 2026, indicating recent maintenance.

Limitations

The dataset size of 350 pull requests is relatively small for training large-scale machine learning models.
Specific column structure, sample data, and file formats are not provided in the input, limiting immediate usability.
Potential bias towards the types of projects and programming languages represented in the collected pull requests.

Provenance

Source: foundry-ai on Hugging Face
Collection Method: Collection of 350 pull requests with human-annotated reviewer feedback.
Time Range: null
Freshness: Last updated 2026-03-24.
Geography: null

null

JSON Librarypolars Languageen Size Categoriesn1 K Modalitytext Modalitytabular Librarymlcroissant Software Engineering Librarydatasets Benchmark Librarypandas Llm Evaluation Licensecc By 40 Code Review Regionus Pull Requests Task Categoriestext Classification

Pull Request Review Benchmark With 350 Human-Annotated Examples

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info