Loading...
Loading...
Available on 1 platform
Sign in to view source links and access this dataset
Pre-tokenized data snapshots used to study language model scaling laws in the data-constrained, compute-rich regime. The dataset consists of packed GPT-2-tokenized sequences stored in .pt files, as referenced in the paper 'Data-Constrained Language Model Pretraining: Improved Regularization and Scaling Laws'. It was uploaded by user zhiwei555 to Hugging Face and last updated on June 8, 2026.
License is unknown; users should verify permissions before use.