Common Crawl English Filtered: 940 Million Web Documents

Name: Common Crawl English Filtered: 940 Million Web Documents
Creator: anandjh8
Published: 2025-10-06T17:34:22
Keywords: Web Text, Common Crawl, Text, Nlp Training, English Language, Large Scale, Synthetic

by anandjh8Updated 29d ago

Available on 1 platform

Sign in to view source links and access this dataset

Description

FineWeb-English-Filtered is a large-scale, cleaned, English-only text dataset derived from Common Crawl’s WET archives. It contains 940 million documents of publicly available web text, converted into Apache Parquet format with a consistent schema for fast and efficient data loading. The dataset was generated by anandjh8 using a custom AWS Glue pipeline and was last updated on 2026-06-04.

Use Cases

Training large language models based on the 940 million English web documents.
Fine-tuning text generation models on the cleaned and filtered web text corpus.
Conducting linguistic analysis on a large-scale sample of public English web content.
Pre-training embeddings or other NLP components using structured web-derived text.

Strengths

Contains 940 million documents, indicating a large scale.
Data is cleaned and filtered to be English-only, as described.
Converted to Apache Parquet format for efficient data loading.

Limitations

Column-level documentation is absent; field semantics must be inferred after download.
Row count is unknown, which may limit suitability assessment.
Description metadata is limited; actual data quality requires manual inspection after download.

Provenance

Source: Common Crawl’s WET archives
Collection Method: Processed, filtered, and merged using a custom AWS Glue pipeline
Freshness: Last updated 2026-06-04 06:44:13; freshness should be verified.

License is unknown; terms of use should be verified before commercial application.

Text Web Text Common Crawl Nlp Training English Language Large Scale Synthetic

Related Datasets

Quality Score

D39

Description

42

Source

36

Reputation

46

Access

26

Community

357 downloads

2 likes

0 views

Dataset Info

Author: anandjh8
Created: Oct 6, 2025
Updated: Jun 4, 2026
Last synced: Jun 11, 2026

Access

26

Community

357 downloads

2 likes

0 views

Dataset Info

Author: anandjh8
Created: Oct 6, 2025
Updated: Jun 4, 2026
Last synced: Jun 11, 2026

Common Crawl English Filtered: 940 Million Web Documents

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info