Loading...
Loading...
Available on 1 platform
Sign in to view source links and access this dataset
The Stack V2 Dedup is a near-deduplicated collection of source code containing between 1 billion and 10 billion records across 600+ programming languages. Produced by BigCode and last updated in April 2024, it serves as a refined subset of the full Stack v2 dataset for training large language models.
Requires the use of Polars, Dask, or Hugging Face Datasets for Parquet file handling; users must adhere to the 'Licenseother' terms.