Loading...
Loading...
Available on 1 platform
Sign in to view source links and access this dataset
Sutra 10B contains 10,193,029 educational entries totaling over 10 billion tokens, released by codelion in March 2026. This synthetic pedagogical dataset is generated via the Sutra framework to provide structured, multi-domain content for pretraining small language models.
The dataset is provided in JSON format and is compatible with Polars, Dask, and the Hugging Face Datasets library for large-scale processing.