DataSalon

Discover quality datasets for AI training — aggregated from 40+ platforms, curated by AI.

ProductSearch Datasets Browse Topics Rankings Community API / MCP

ResourcesDocumentation Blog Changelog Status

LegalPrivacy Policy Terms of Service Cookie Policy

Dataflow Demo: Filtered Web Crawl Text for Pretraining | DataSalon

Home Materials ScienceDataflow Demo: Filtered Web Crawl Text for Pretraining

Materials Science

Dataflow Demo: Filtered Web Crawl Text for Pretraining

Name: Dataflow Demo: Filtered Web Crawl Text for Pretraining
Creator: OpenDCAI
Published: 2025-06-16T08:02:52
Keywords: Text Filtering, Web Crawl, Data Processing, Text, Tabular, Nlp Pipeline

by OpenDCAI·Updated 6mo ago

Available on 1 platform

Description

A 2026-01-05 demo dataset from OpenDCAI showing a text processing pipeline. It contains raw and filtered web page data from Common Crawl, with the filtered output file being 2.54 MB. The purpose is to demonstrate a pipeline for cleaning and structuring text data for pretraining.

Use Cases

Benchmarking text filtering algorithms based on the raw-to-processed data comparison.
Developing web text cleaning pipelines based on the demonstrated Common Crawl processing.
Studying data quality metrics for pretraining corpora based on the described filtering pipeline.

Strengths

Demonstrates a clear before-and-after data state with raw (206 MB) and filtered (2.54 MB) files.
Provides a concrete example of processing Common Crawl data, a common web corpus source.

Limitations

Column-level documentation is absent; field semantics must be inferred after download.
Row count is unknown, which may limit suitability assessment.
Description metadata is limited; actual data quality requires manual inspection after download.

Provenance

Source: Common Crawl
Collection Method: Web crawling and subsequent filtering/structuring via the DataFlow project pipeline.
Freshness: Last updated 2026-01-05 10:01:09; freshness should be verified.

License is unknown; terms of use must be verified before application.

Text Tabular Text Filtering Web Crawl Data Processing Nlp Pipeline

Related Datasets

Quality Score

C41

Description

Source

Reputation

Quality Score

C41

Description

Source

Reputation

Access

Community

30 downloads

2 likes

0 views

Dataset Info

Author: OpenDCAI
Created: Jun 16, 2025
Updated: Jan 5, 2026
Last synced: Apr 22, 2026

Access

Community

30 downloads

2 likes

0 views

Dataset Info

Author: OpenDCAI
Created: Jun 16, 2025
Updated: Jan 5, 2026
Last synced: Apr 22, 2026

Dataflow Demo: Filtered Web Crawl Text for Pretraining

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info