Sutra 10B: 10 Billion Tokens of Synthetic Pedagogical Pretraining Data

Name: Sutra 10B: 10 Billion Tokens of Synthetic Pedagogical Pretraining Data
Creator: codelion
Published: 2026-02-25T13:54:59
Keywords: Task Categoriestext Generation, Librarypolars, Librarydask, 10 B, Size Categories1 Mn10 M, Languageen, Modalitytext, Educational, Modalitytabular, Pedagogical, Librarymlcroissant, Librarydatasets, Pretraining, Multi Domain, Regionus, JSON, Licenseapache 20, Sutra, Synthetic

by codelionUpdated 3mo ago

Available on 1 platform

Sign in to view source links and access this dataset

Description

Sutra 10B contains 10,193,029 educational entries totaling over 10 billion tokens, released by codelion in March 2026. This synthetic pedagogical dataset is generated via the Sutra framework to provide structured, multi-domain content for pretraining small language models.

Use Cases

Pretraining small language models using the 10 billion tokens of pedagogical text
Fine-tuning models for educational text generation using the structured entry format
Benchmarking the performance of synthetic data against human-authored corpora in LLM training

Strengths

Contains 10,193,029 discrete educational entries
Scale of 10 billion tokens supports full pretraining of small language models
Apache 2.0 license permits commercial and research use
Multi-domain coverage across various educational subjects

Limitations

Synthetic nature may introduce artifacts or biases inherent to the Sutra generation framework
Lack of human-verified ground truth for pedagogical accuracy
Undocumented column structure requires initial exploratory data analysis

Provenance

Source: codelion
Collection Method: synthetic
Freshness: Last updated March 2026.

The dataset is provided in JSON format and is compatible with Polars, Dask, and the Hugging Face Datasets library for large-scale processing.

JSON Task Categoriestext Generation Librarypolars Librarydask 10 B Size Categories1 Mn10 M Languageen Modalitytext Educational Modalitytabular Pedagogical Librarymlcroissant Librarydatasets Pretraining Multi Domain Regionus Licenseapache 20 Sutra Synthetic

Related Datasets

Quality Score

D39

Description

39

Source

39

Reputation

50

Access

22

Community

2.0K downloads

4 likes

0 views

Dataset Info

Author: codelion
Created: Feb 25, 2026
Updated: Mar 8, 2026
Last synced: May 30, 2026

Access

22

Community

2.0K downloads

4 likes

0 views

Dataset Info

Author: codelion
Created: Feb 25, 2026
Updated: Mar 8, 2026
Last synced: May 30, 2026

Sutra 10B: 10 Billion Tokens of Synthetic Pedagogical Pretraining Data

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info