Github Codereview

Name: Github Codereview
Creator: ronantakizawa
Published: 2026-03-03T18:27:45
Keywords: Languagecode, Task Categoriestext Generation, Licenseother, Librarypolars, Librarydask, Languageen, Modalitytext, Size Categories100 Kn1 M, Github, Modalitytabular, Librarymlcroissant, Software Engineering, Librarydatasets, Parquet, Code Review, Code Generation, Regionus, Pull Requests

by ronantakizawaUpdated 3mo ago

Available on 1 platform

Sign in to view source links and access this dataset

Description

Github Codereview is a large-scale dataset containing high-quality human-written code reviews sourced from top GitHub repositories. It captures the interaction between inline reviewer comments on pull requests and the subsequent code modifications made by authors. The dataset is designed to provide a natural signal for training models to understand code quality and the iterative review process.

Use Cases

Training models to generate automated code review comments
Developing automated code refinement and bug-fixing tools
Training classifiers to identify code that likely requires human intervention
Researching software engineering patterns in pull request workflows

Strengths

Includes negative examples of code that passed review to reduce false positives in models
Sourced from 'top' GitHub repositories, implying a high standard of human review
Captures the direct relationship between natural language feedback and code refinement

Limitations

Specific schema and column descriptions are currently unknown
The license is categorized as 'other', which may require manual verification for specific use cases

Provenance

Source: GitHub pull requests from top-tier repositories.
Collection Method: Extracted from GitHub's public pull request and inline comment data.
Freshness: Last updated on March 10, 2026.
Geography: Global (GitHub repositories)

Parquet Languagecode Task Categoriestext Generation Licenseother Librarypolars Librarydask Languageen Modalitytext Size Categories100 Kn1 M Github Modalitytabular Librarymlcroissant Software Engineering Librarydatasets Code Review Code Generation Regionus Pull Requests

Related Datasets

Quality Score

D40

Description

42

Source

36

Reputation

55

Access

22

Community

1.0K downloads

50 likes

0 views

Dataset Info

Author: ronantakizawa
Created: Mar 3, 2026
Updated: Mar 10, 2026
Last synced: Jun 23, 2026

Access

22

Community

1.0K downloads

50 likes

0 views

Dataset Info

Author: ronantakizawa
Created: Mar 3, 2026
Updated: Mar 10, 2026
Last synced: Jun 23, 2026

Github Codereview

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info