Loading...
Loading...
Available on 1 platform
Sign in to view source links and access this dataset
201 tokenized data shards for training language models, stored in .tokbin format. The data was uploaded by marissatech and last updated on June 12, 2026. Each shard contains sequences of token IDs from a vocabulary of 24,256 tokens.
The dataset is intended to be private. Data is stored in a custom .tokbin binary format with uint16 little-endian length headers.