Loading...
Loading...
Available on 1 platform
Sign in to view source links and access this dataset
OpenCSG released Fineweb-Edu-Chinese V2.2 in February 2026, providing a massive corpus of 10 billion to 100 billion tokens for the Chinese education sector. This collection supports the full development lifecycle of Large Language Models by including both pre-training data and Supervised Fine-Tuning (SFT) instruction pairs.
Licensed under Apache 2.0; users should consult Arxiv 2501.08197 for technical details regarding the synthetic data generation and filtering process.