Loading...
Loading...
Available on 1 platform
Sign in to view source links and access this dataset
FineWeb-English-Filtered is a large-scale, cleaned, English-only text dataset derived from Common Crawl’s WET archives. It contains 940 million documents of publicly available web text, converted into Apache Parquet format with a consistent schema for fast and efficient data loading. The dataset was generated by anandjh8 using a custom AWS Glue pipeline and was last updated on 2026-06-04.
License is unknown; terms of use should be verified before commercial application.