Loading...
Loading...
Available on 1 platform
Sign in to view source links and access this dataset
Sutra 100M contains 70,435 synthetic educational entries totaling approximately 100 million tokens, released by codelion in March 2026. The dataset utilizes the Sutra framework to produce structured pedagogical content specifically designed for language model pretraining across multiple domains.
The dataset is distributed under the Apache 2.0 license and is compatible with common data libraries including pandas, polars, and Hugging Face datasets.