Name: DCLM Data 400M: Pre-Tokenized Sequences for Data-Constrained Language Model Pretraining
Creator: zhiwei555
Published: 2026-04-20T00:47:55
Keywords: Language Model Pretraining, Nlp Research, Text, Gpt 2 Tokenized, Data Constrained

Description

Pre-tokenized data snapshots used to study language model scaling laws in the data-constrained, compute-rich regime. The dataset consists of packed GPT-2-tokenized sequences stored in .pt files, as referenced in the paper 'Data-Constrained Language Model Pretraining: Improved Regularization and Scaling Laws'. It was uploaded by user zhiwei555 to Hugging Face and last updated on June 8, 2026.

Use Cases

Replicating experiments on data-constrained pretraining based on the described methodology.
Studying multi-epoch training effects on language models using the provided pre-tokenized sequences.
Benchmarking model performance in compute-rich, data-limited regimes as outlined in the paper.

Strengths

Data is pre-tokenized with GPT-2's tokenizer, reducing preprocessing overhead.
Specifically curated for studying the data-constrained pretraining regime described in a published paper.

Limitations

Description metadata is limited; actual data quality requires manual inspection after download.
Column-level documentation is absent; field semantics must be inferred after download.
Row count is unknown, which may limit suitability assessment.

Provenance

Source: zhiwei555 on Hugging Face, associated with the cited research paper.
Collection Method: Created for the paper 'Data-Constrained Language Model Pretraining: Improved Regularization and Scaling Laws'; method likely involves pre-tokenizing source text data.
Freshness: Last updated 2026-06-08 18:46:37; freshness should be verified.

License is unknown; users should verify permissions before use.

Text Language Model Pretraining Nlp Research Gpt 2 Tokenized Data Constrained

DCLM Data 400M: Pre-Tokenized Sequences for Data-Constrained Language Model Pretraining

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info