Name: Dclm Data 100M: Pre-tokenized Sequences for Data-Constrained Language Model Training
Creator: zhiwei555
Published: 2026-04-20T00:49:49
Keywords: Language Model Pretraining, Nlp Research, Tokenized Text, Text, Data Constrained

Description

Pre-tokenized .pt files containing packed GPT-2-tokenized sequences used for language model pretraining. The dataset provides a 100 million token training split and a validation split, prepared for research in data-constrained, compute-rich regimes. It was created by author zhiwei555 and is associated with the paper 'Data-Constrained Language Model Pretraining: Improved Regularization and Scaling Laws'.

Use Cases

Training language models based on the provided pre-tokenized sequences.
Studying scaling laws and regularization techniques for data-constrained pretraining.
Benchmarking model performance in compute-rich but data-limited regimes.
Reproducing experiments from the associated research paper on data-constrained pretraining.

Strengths

Contains a specifically prepared 100 million token training split.
Includes a dedicated validation split for model evaluation.
Data is pre-tokenized for GPT-2, potentially reducing preprocessing overhead.

Limitations

Column-level documentation is absent; field semantics must be inferred after download.
Row count is unknown, which may limit suitability assessment.
Description metadata is limited; actual data quality requires manual inspection after download.

Provenance

Source: huggingface dataset uploaded by author zhiwei555.
Collection Method: Created for the research presented in the paper 'Data-Constrained Language Model Pretraining: Improved Regularization and Scaling Laws'.
Freshness: Last updated 2026-06-08 18:47:54; freshness should be verified.

License is unknown; users should verify terms of use before downloading.

Text Language Model Pretraining Nlp Research Tokenized Text Data Constrained

Dclm Data 100M: Pre-tokenized Sequences for Data-Constrained Language Model Training

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info