Stack Final: Filtered Multi-Language Source Code Dataset

Name: Stack Final: Filtered Multi-Language Source Code Dataset
Creator: MaLA-LM
Published: 2024-04-29T10:11:02
Keywords: Source Code, Text, Filtered Dataset, Programming Languages

by MaLA-LMUpdated 2y ago

Available on 1 platform

Sign in to view source links and access this dataset

Description

A filtered subset of the bigcode/the-stack-dedup dataset containing source code files in 41 programming languages, including Ada, Assembly, C, C++, Python, Java, and Rust. It was created by MaLA-LM and last updated on Hugging Face in April 2024. The dataset's exact size, row count, and column structure are not specified in the provided metadata.

Use Cases

Train code generation models based on the multi-language source code content.
Benchmark language model performance on specific programming languages mentioned in the description.
Fine-tune models for code summarization or documentation using the included Markdown and comment data.
Study coding patterns and idioms across different programming language families listed in the description.

Strengths

Focuses on 41 specific programming languages, enabling targeted language-specific analysis.
Derived from the established bigcode/the-stack-dedup dataset, suggesting a foundation of deduplicated source code.

Limitations

Description metadata is limited; actual data quality requires manual inspection after download.
Column-level documentation is absent; field semantics must be inferred after download.
Row count is unknown, which may limit suitability assessment.

Provenance

Source: Filtered from bigcode/the-stack-dedup dataset.
Collection Method: Subsetting based on specified programming language splits.
Freshness: Last updated 2024-04-29 10:46:27; freshness should be verified.

License is unknown, which may restrict commercial or research use.

Text Source Code Filtered Dataset Programming Languages

Related Datasets

Quality Score

D33

Description

39

Source

36

Reputation

22

Access

26

Community

1.2K downloads

1 likes

0 views

Dataset Info

Author: MaLA-LM
Created: Apr 29, 2024
Updated: Apr 29, 2024
Last synced: May 17, 2026

Access

26

Community

1.2K downloads

1 likes

0 views

Dataset Info

Author: MaLA-LM
Created: Apr 29, 2024
Updated: Apr 29, 2024
Last synced: May 17, 2026

Stack Final: Filtered Multi-Language Source Code Dataset

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info