A benchmark for commit message generation featuring code changes and English natural language descriptions across six programming languages: Java, Python, Go, JavaScript, PHP, and Ruby. It is constructed from GitHub repositories with permissive licenses to ensure reproducibility and legal compliance.
Use Cases
- Train machine learning models to generate commit messages using the code diffs and English message pairs
- Benchmark the zero-shot performance of LLMs on commit summarization across the six supported programming languages
- Analyze developer documentation patterns by comparing commit messages against code changes in Java, Python, and Go
Strengths
- Covers six programming languages: Java, Python, Go, JavaScript, PHP, and Ruby
- Restricts all natural language commit messages to English
- Sourced exclusively from GitHub repositories with licenses permitting redistribution
- Provides a reproducible benchmark specifically for the commit message generation task