Fineweb-Edu-Chinese V2.2: 10B-100B Tokens for Educational LLMs

Name: Fineweb-Edu-Chinese V2.2: 10B-100B Tokens for Educational LLMs
Creator: opencsg
Published: 2026-01-30T07:02:48
Keywords: Task Categoriestext Generation, Languagezh, Task Categoriesquestion Answering, Arxiv250108197, Arxiv230515717, Education, Size Categories10 Bn100 B, Sft, Regionus, Natural Language Processing, Arxiv230511206, Licenseapache 20, Synthetic

by opencsgUpdated 5mo ago

Available on 1 platform

Sign in to view source links and access this dataset

Description

OpenCSG released Fineweb-Edu-Chinese V2.2 in February 2026, providing a massive corpus of 10 billion to 100 billion tokens for the Chinese education sector. This collection supports the full development lifecycle of Large Language Models by including both pre-training data and Supervised Fine-Tuning (SFT) instruction pairs.

Use Cases

Pre-training foundational models using the 10B-100B token educational text corpus
Supervised Fine-Tuning (SFT) for educational Question Answering using the instruction-tuning features
Generating synthetic educational content based on the synthetic data patterns provided

Strengths

Massive scale of 10B to 100B tokens
Apache 2.0 license for open commercial and research use
Integrated support for both pre-training and SFT workflows

Limitations

Presence of synthetic data may introduce specific artifacts or hallucinations
Metadata indicates a US region tag which may imply geographic biases in source selection despite the Chinese language focus

Provenance

Source: OpenCSG Community
Collection Method: Scraped and synthetic
Freshness: Last updated February 2026.
Geography: China (language focus), US (metadata region)

Licensed under Apache 2.0; users should consult Arxiv 2501.08197 for technical details regarding the synthetic data generation and filtering process.

Task Categoriestext Generation Languagezh Task Categoriesquestion Answering Arxiv250108197 Arxiv230515717 Education Size Categories10 Bn100 B Sft Regionus Natural Language Processing Arxiv230511206 Licenseapache 20 Synthetic

Related Datasets

Quality Score

C41

Description

42

Source

36

Reputation

59

Access

22

Community

39.6K downloads

71 likes

0 views

Dataset Info

Author: opencsg
Created: Jan 30, 2026
Updated: Feb 2, 2026
Last synced: Jun 23, 2026

Access

22

Community

39.6K downloads

71 likes

0 views

Dataset Info

Author: opencsg
Created: Jan 30, 2026
Updated: Feb 2, 2026
Last synced: Jun 23, 2026

Fineweb-Edu-Chinese V2.2: 10B-100B Tokens for Educational LLMs

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info