Loading...
Loading...
Available on 1 platform
Sign in to view source links and access this dataset
80,489,226 rows of raw English text form this multi-domain corpus engineered for pre-training BERT-style models via Masked Language Modeling. The dataset, created by 8Opt, aggregates text stripped of metadata and labels to focus purely on language. It was last updated on June 20, 2026.
License is unknown, which may restrict commercial or research use.