Loading...
Loading...
Available on 1 platform
Sign in to view source links and access this dataset
A curated collection of 4.79 million Wikipedia articles spanning the 2008 and 2010 snapshot releases, cleaned and compressed for efficient large-scale language model pretraining. This dataset preserves the raw encyclopedic knowledge of two distinct eras of Wikipedia, making it valuable for temporal analysis, knowledge evolution research, and foundation model training. It was created by author 'adhyanshaa' and last updated on 2026-06-04.
License is unknown; users must verify licensing terms before use.