Loading...
Loading...
Available on 1 platform
Sign in to view source links and access this dataset
Pre-tokenized .pt files containing packed GPT-2-tokenized sequences derived from the DCLM corpus. The dataset snapshots were curated by author zhiwei555 for the paper 'Data-Constrained Language Model Pretraining: Improved Regularization and Scaling Laws'. They were last updated on June 8, 2026.
License is unknown; users should verify permissions before use.