DataSalon

Discover quality datasets for AI training — aggregated from 40+ platforms, curated by AI.

ProductSearch Datasets Browse Topics Rankings Community API / MCP

ResourcesDocumentation Blog Changelog Status

LegalPrivacy Policy Terms of Service Cookie Policy

HumanEvalPack: Multi-Language Code Evaluation Benchmark | DataSalon

Home Software Engineering & SecurityHumanEvalPack: Multi-Language Code Evaluation Benchmark

Software Engineering & Security

HumanEvalPack: Multi-Language Code Evaluation Benchmark

Name: HumanEvalPack: Multi-Language Code Evaluation Benchmark
Creator: bigcode
Published: 2023-03-29T12:00:16
Keywords: Benchmark, Code Evaluation, Text, Software Testing, Programming Languages

by bigcode·Updated 9mo ago

Available on 1 platform

Description

HumanEvalPack extends OpenAI's HumanEval benchmark to cover six programming languages across three tasks. The dataset includes Python, JavaScript, Java, Go, C++, and Rust splits, with non-Python splits translated and cleaned by humans. It was created by the bigcode organization and updated on Hugging Face in August 2025.

Use Cases

Benchmarking code generation models based on multi-language problem-solving tasks
Evaluating AI performance on software engineering tasks across different programming languages
Training models for multilingual code completion based on the translated problem sets
Comparing model accuracy between Python and other languages using the cleaned translations

Strengths

Extends a known benchmark (OpenAI's HumanEval) to six programming languages
Includes human-translated and cleaned splits for languages beyond Python
Created by the bigcode organization, which suggests a focus on code-related datasets

Limitations

Column-level documentation is absent; field semantics must be inferred after download
Row count is unknown, which may limit suitability assessment
Description metadata is limited; actual data quality requires manual inspection after download

Provenance

Source: bigcode
Collection Method: Extension of OpenAI's HumanEval with human translation and cleaning
Freshness: Last updated 2025-08-19 20:35:51; freshness should be verified

Text Benchmark Code Evaluation Software Testing Programming Languages

Related Datasets

Quality Score

C43

Description

Source

Reputation

Quality Score

C43

Description

Source

Reputation

Access

Community

4.7K downloads

92 likes

0 views

Dataset Info

Author: bigcode
Created: Mar 29, 2023
Updated: Aug 19, 2025
Last synced: Jun 7, 2026

Access

Community

4.7K downloads

92 likes

0 views

Dataset Info

Author: bigcode
Created: Mar 29, 2023
Updated: Aug 19, 2025
Last synced: Jun 7, 2026

HumanEvalPack: Multi-Language Code Evaluation Benchmark

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info