LAURA: 301k Code Review Entries from 1,807 GitHub Projects
by Yuxin Zhang·Updated 1mo ago
6.7 GB14files
Available on 1 platform
Sign in to view source links and access this dataset
Description
301,256 entries of code review data from 1,807 high-quality GitHub projects in C, C++, Java, and Python. This dataset, created by Yuxin Zhang and released in April 2026, supports research on retrieval-augmented generation for automated code review. It includes a manually annotated evaluation subset of 384 entries and a time-split retrieval database.
Use Cases
Training or evaluating retrieval-augmented LLMs for code review generation based on the provided review exemplars and code diffs.
Benchmarking the performance of LLMs like ChatGPT-4o and DeepSeek v3 on code review tasks using the human-annotated evaluation set.
Studying code review patterns and comment quality across four major programming languages (C, C++, Java, Python).
Fine-tuning specialized models for automated code review, as demonstrated with the CodeReviewer model mentioned in the description.
Strengths
Contains 301,256 entries sourced from 1,807 high-quality GitHub projects.
Includes a manually annotated evaluation dataset of 384 entries for reliable benchmarking.
Provides a time-split retrieval database of 298,494 entries, enabling realistic temporal evaluation splits.
Covers four major programming languages: C, C++, Java, and Python.
Limitations
Column-level documentation is absent; field semantics must be inferred after download.
Row count for individual component files (e.g., evaluation_data.csv) is unknown, which may limit suitability assessment.
The description does not specify the time range of the collected GitHub data, making temporal bias difficult to evaluate.
Provenance
Source
GitHub, collected via the GitHub GraphQL API.
Collection Method
Data was collected, filtered with rule-based and LLM-based methods, and processed with provided Python scripts.
Time Range
The retrieval database contains entries prior to December 26, 2024; the evaluation set contains entries later than that date.
Freshness
Last updated 2026-04-27 13:28:58; freshness should be verified.
Geography
null
The dataset includes Excel files; note that the evaluation results file may have some longer comments partially truncated due to Excel's 8192-character limit.