Loading...
Loading...
Available on 1 platform
Sign in to view source links and access this dataset
A dataset snapshot of pre-tokenized sequences used in the paper 'Data-Constrained Language Model Pretraining: Improved Regularization and Scaling Laws'. The data consists of packed GPT-2-tokenized sequences derived from the DCLM corpus, prepared for studying pretraining in data-constrained, compute-rich regimes. The snapshot was uploaded by author zhiwei555 to Hugging Face.
License is unknown; users should verify terms before use.