The Stack V2 Dedup: Near-Deduplicated Source Code from 600+ Languages

Name: The Stack V2 Dedup: Near-Deduplicated Source Code from 600+ Languages
Creator: bigcode
Published: 2024-02-26T09:58:00
Keywords: Languagecode, Task Categoriestext Generation, Language Creatorsexpert Generated, Licenseother, Librarypolars, Language Creatorscrowdsourced, Arxiv240219173, Librarydask, Modalitytext, Size Categories1 Bn10 B, Modalitytabular, Librarymlcroissant, Arxiv220714157, Librarydatasets, Parquet, Multilingualitymultilingual, Regionus, Arxiv210703374

by bigcodeUpdated 2y ago

Available on 1 platform

Sign in to view source links and access this dataset

Description

The Stack V2 Dedup is a near-deduplicated collection of source code containing between 1 billion and 10 billion records across 600+ programming languages. Produced by BigCode and last updated in April 2024, it serves as a refined subset of the full Stack v2 dataset for training large language models.

Use Cases

Training code-specific LLMs using the multilingual 'text' modality
Benchmarking deduplication efficiency across 600+ programming languages
Repository-level code analysis using the repository-grouped data organization

Strengths

Includes 600+ programming languages
Near-deduplicated to minimize data leakage and training bias
Scale of 1 billion to 10 billion records

Limitations

Custom license ('Licenseother') may restrict commercial usage or redistribution
Massive scale (1B+ records) requires significant storage and high-performance computing resources

Provenance

Source: BigCode
Collection Method: Scraped from public software repositories
Freshness: Last updated April 2024.
Geography: Global

Requires the use of Polars, Dask, or Hugging Face Datasets for Parquet file handling; users must adhere to the 'Licenseother' terms.

Parquet Languagecode Task Categoriestext Generation Language Creatorsexpert Generated Licenseother Librarypolars Language Creatorscrowdsourced Arxiv240219173 Librarydask Modalitytext Size Categories1 Bn10 B Modalitytabular Librarymlcroissant Arxiv220714157 Librarydatasets Multilingualitymultilingual Regionus Arxiv210703374

Related Datasets

Quality Score

D37

Description

39

Source

36

Reputation

46

Access

22

Community

3.5K downloads

122 likes

0 views

Dataset Info

Author: bigcode
Created: Feb 26, 2024
Updated: Apr 23, 2024
Last synced: Jun 7, 2026

Access

22

Community

3.5K downloads

122 likes

0 views

Dataset Info

Author: bigcode
Created: Feb 26, 2024
Updated: Apr 23, 2024
Last synced: Jun 7, 2026

The Stack V2 Dedup: Near-Deduplicated Source Code from 600+ Languages

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info