HumanEvalPack extends OpenAI's HumanEval benchmark to cover six programming languages across three tasks. The dataset includes Python, JavaScript, Java, Go, C++, and Rust splits, with non-Python splits translated and cleaned by humans. It was created by the bigcode organization and updated on Hugging Face in August 2025.
Use Cases
- Benchmarking code generation models based on multi-language problem-solving tasks
- Evaluating AI performance on software engineering tasks across different programming languages
- Training models for multilingual code completion based on the translated problem sets
- Comparing model accuracy between Python and other languages using the cleaned translations
Strengths
- Extends a known benchmark (OpenAI's HumanEval) to six programming languages
- Includes human-translated and cleaned splits for languages beyond Python
- Created by the bigcode organization, which suggests a focus on code-related datasets
Limitations
- Column-level documentation is absent; field semantics must be inferred after download
- Row count is unknown, which may limit suitability assessment
- Description metadata is limited; actual data quality requires manual inspection after download
Provenance
- Source
- bigcode
- Collection Method
- Extension of OpenAI's HumanEval with human translation and cleaning
- Freshness
- Last updated 2025-08-19 20:35:51; freshness should be verified