Loading...
Loading...
Available on 1 platform
Sign in to view source links and access this dataset
Pre-tokenized .pt files containing packed GPT-2-tokenized sequences used for language model pretraining. The dataset provides a 100 million token training split and a validation split, prepared for research in data-constrained, compute-rich regimes. It was created by author zhiwei555 and is associated with the paper 'Data-Constrained Language Model Pretraining: Improved Regularization and Scaling Laws'.
License is unknown; users should verify terms of use before downloading.