Name: WCXB: Web Content Extraction Benchmark with 2,008 Pages Across 7 Types
Creator: murrough-foley
Published: 2026-03-29T15:11:15
Keywords: Web Content Extraction, Benchmark, Text, Natural Language Processing, Html Processing, Boilerplate Removal

Description

WCXB is the largest open benchmark for evaluating web content extraction, boilerplate removal, and main content detection. It provides 2,008 human-reviewed web pages spanning 7 page types and 1,613 domains, with ground truth annotations, HTML source files, and baseline results from 14 extraction systems. The dataset was created by murrough-foley and last updated on 2026-04 04.

Use Cases

Benchmarking web content extraction systems based on the provided ground truth annotations and baseline results
Training models for boilerplate removal based on the diverse HTML source files and page types
Evaluating main content detection algorithms across non-news page types mentioned in the description
Researching the performance of extraction tools on a large, domain-diverse collection of web pages

Strengths

Contains 2,008 human-reviewed web pages, providing a substantial evaluation corpus
Spans 7 distinct page types and 1,613 domains, offering diversity beyond typical news-only benchmarks
Includes ground truth annotations, HTML source files, and baseline results from 14 extraction systems

Limitations

Column-level documentation is absent; field semantics must be inferred after download
Row count is unknown, which may limit suitability assessment
Last updated 2026-04-04 06:27:59; freshness should be verified

Provenance

Source: huggingface
Collection Method: Human-reviewed web pages with ground truth annotations.
Time Range: null
Freshness: Last updated 2026-04-04 06:27:59.
Geography: null

License is unknown and should be verified before use.

Text Web Content Extraction Benchmark Natural Language Processing Html Processing Boilerplate Removal

WCXB: Web Content Extraction Benchmark with 2,008 Pages Across 7 Types

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info