Name: DCLM Data 200M: Packed GPT-2 Token Sequences for Data-Constrained Pretraining
Creator: zhiwei555
Published: 2026-04-20T00:43:29
Keywords: Language Modeling, Pretraining, Text, Natural Language Processing, Gpt 2 Tokens, Text Corpus

Description

A dataset snapshot of pre-tokenized sequences used in the paper 'Data-Constrained Language Model Pretraining: Improved Regularization and Scaling Laws'. The data consists of packed GPT-2-tokenized sequences derived from the DCLM corpus, prepared for studying pretraining in data-constrained, compute-rich regimes. The snapshot was uploaded by author zhiwei555 to Hugging Face.

Use Cases

Training data-constrained language models based on pre-tokenized sequences.
Studying pretraining scaling laws based on the DCLM corpus.
Benchmarking regularization methods for language models based on packed token data.
Analyzing the effects of data constraints on model performance based on the described corpus.

Strengths

Dataset is directly linked to a published research paper, providing academic context.
Data is pre-tokenized and packed, which may reduce preprocessing overhead for specific research setups.
The description specifies the data is derived from the DCLM corpus for a specific research purpose.

Limitations

Column-level documentation is absent; field semantics must be inferred after download.
Row count, file size, and specific file formats are unknown, which may limit suitability assessment.
Description metadata is limited; actual data quality requires manual inspection after download.

Provenance

Source: Derived from the DCLM (Data-Centric Language Modeling) corpus.
Collection Method: Prepared as a snapshot for research on data-constrained pretraining.
Freshness: Last updated 2026-06-08 18:45:32; freshness should be verified.

License is unknown; users should verify terms before use.

Text Language Modeling Pretraining Natural Language Processing Gpt 2 Tokens Text Corpus

DCLM Data 200M: Packed GPT-2 Token Sequences for Data-Constrained Pretraining

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info