Loading...
Loading...
Available on 1 platform
Sign in to view source links and access this dataset
A 2026-01-05 demo dataset from OpenDCAI showing a text processing pipeline. It contains raw and filtered web page data from Common Crawl, with the filtered output file being 2.54 MB. The purpose is to demonstrate a pipeline for cleaning and structuring text data for pretraining.
License is unknown; terms of use must be verified before application.