Sign in to view source links and access this dataset
Description
WCXB is the largest open benchmark for evaluating web content extraction, boilerplate removal, and main content detection. It provides 2,008 human-reviewed web pages spanning 7 page types and 1,613 domains, with ground truth annotations, HTML source files, and baseline results from 14 extraction systems. The dataset was created by murrough-foley and last updated on 2026-04 04.
Use Cases
Benchmarking web content extraction systems based on the provided ground truth annotations and baseline results
Training models for boilerplate removal based on the diverse HTML source files and page types
Evaluating main content detection algorithms across non-news page types mentioned in the description
Researching the performance of extraction tools on a large, domain-diverse collection of web pages
Strengths
Contains 2,008 human-reviewed web pages, providing a substantial evaluation corpus