Name: Mixed Text Dataset: 20,000 Samples from Wikipedia, Gutenberg, and CNN/DailyMail
Creator: brograrnmer
Published: 2026-05-05T07:57:43
Keywords: Human Written Text, Gutenberg, Computer Vision, Text, Wikipedia, Cnn Dailymail, Text Corpus

Description

20,000 text samples compiled from three distinct sources: Wikipedia, Project Gutenberg, and CNN/DailyMail. The dataset was created by author 'brograrnmer' and last updated on May 5, 2026. Preprocessing involved regex cleaning to replace certain patterns with whitespace.

Use Cases

Train language models on diverse writing styles based on the mixture of encyclopedic, literary, and news text.
Benchmark text summarization models using the CNN/DailyMail news articles mentioned in the description.
Study stylistic differences in human-written text across the three documented source domains.
Pre-train or fine-tune models for tasks requiring general domain knowledge from the Wikipedia subset.

Strengths

Contains 20,000 total text samples, providing a substantial corpus for training.
Explicitly documents the sample count from each of its three sources: 10,000 from Wikipedia, 3,000 from Gutenberg, and 7,000 from CNN/DailyMail.
Specifies the preprocessing method used, which was regex_cleaning.

Limitations

Column-level documentation is absent; field semantics must be inferred after download.
Row count is unknown, which may limit suitability assessment for certain tasks.
The Wikipedia dump is from March 2022, so the information may not be current.

Provenance

Source: Compiled from Wikipedia (20220301 dump), Project Gutenberg (via GutenDex API), and CNN/DailyMail (Version 3.0.0).
Collection Method: Aggregation from multiple open sources, followed by regex-based cleaning.
Time Range: The Wikipedia component is from a March 2022 snapshot. Temporal coverage for other sources is unspecified.
Freshness: Last updated 2026-05-05 08:12:08; freshness should be verified.

License information is unknown, which may restrict commercial or redistribution use.

Text Human Written Text Gutenberg Computer Vision Wikipedia Cnn Dailymail Text Corpus

Mixed Text Dataset: 20,000 Samples from Wikipedia, Gutenberg, and CNN/DailyMail

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info