Github Codereview is a large-scale dataset containing high-quality human-written code reviews sourced from top GitHub repositories. It captures the interaction between inline reviewer comments on pull requests and the subsequent code modifications made by authors. The dataset is designed to provide a natural signal for training models to understand code quality and the iterative review process.
Use Cases
- Training models to generate automated code review comments
- Developing automated code refinement and bug-fixing tools
- Training classifiers to identify code that likely requires human intervention
- Researching software engineering patterns in pull request workflows
Strengths
- Includes negative examples of code that passed review to reduce false positives in models
- Sourced from 'top' GitHub repositories, implying a high standard of human review
- Captures the direct relationship between natural language feedback and code refinement
Limitations
- Specific schema and column descriptions are currently unknown
- The license is categorized as 'other', which may require manual verification for specific use cases
Provenance
- Source
- GitHub pull requests from top-tier repositories.
- Collection Method
- Extracted from GitHub's public pull request and inline comment data.
- Freshness
- Last updated on March 10, 2026.
- Geography
- Global (GitHub repositories)