DCLM Data 300M: GPT-2-Tokenized Sequences for Data-Constrained Language Model Training

Name: DCLM Data 300M: GPT-2-Tokenized Sequences for Data-Constrained Language Model Training
Creator: zhiwei555
Published: 2026-05-20T23:09:32
Keywords: Language Model Pretraining, Tokenized Data, Nlp Research, Text, Natural Language Processing, Text Corpus

by zhiwei555Updated 22d ago

Available on 1 platform

Sign in to view source links and access this dataset

Description

Pre-tokenized .pt files containing packed GPT-2-tokenized sequences derived from the DCLM corpus. The dataset snapshots were curated by author zhiwei555 for the paper 'Data-Constrained Language Model Pretraining: Improved Regularization and Scaling Laws'. They were last updated on June 8, 2026.

Use Cases

Training language models using Masked-Input Regularization (MIR) based on the described methodology.
Studying scaling laws for language model pretraining under data constraints.
Benchmarking model performance on pre-tokenized sequences derived from the DCLM corpus.

Strengths

Data is pre-tokenized into GPT-2-tokenized sequences, which likely reduces preprocessing overhead.
Dataset is explicitly curated for studying data-constrained language model pretraining, providing a focused resource.

Limitations

Column-level documentation is absent; field semantics must be inferred after download.
Row count is unknown, which may limit suitability assessment.
Description metadata is limited; actual data quality requires manual inspection after download.

Provenance

Source: Derived from the DCLM corpus.
Collection Method: Packed GPT-2-tokenized sequences, curated as snapshots for a research paper.
Freshness: Last updated 2026-06-08 18:46:18; freshness should be verified.

License is unknown; users should verify permissions before use.

Text Language Model Pretraining Tokenized Data Nlp Research Natural Language Processing Text Corpus

Related Datasets

Quality Score

D39

Description

42

Source

39

Reputation

39

Access

26

Community

19 downloads

1 likes

0 views

Dataset Info

Author: zhiwei555
Created: May 20, 2026
Updated: Jun 8, 2026
Last synced: Jun 15, 2026

Access

26

Community

19 downloads

1 likes

0 views

Dataset Info

Author: zhiwei555
Created: May 20, 2026
Updated: Jun 8, 2026
Last synced: Jun 15, 2026

DCLM Data 300M: GPT-2-Tokenized Sequences for Data-Constrained Language Model Training

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info