Pre-1900 Corpus: A Collection of Historical English Texts with Metadata

Name: Pre-1900 Corpus: A Collection of Historical English Texts with Metadata
Creator: mhla
Published: 2026-02-21T09:04:48
Keywords: Language Modeling, Historical Text, Pre 1900, Text, English Language, Natural Language Processing

by mhlaUpdated 3mo ago

Available on 1 platform

Sign in to view source links and access this dataset

Description

A cleaned corpus of English-language texts published before the year 1900, intended for training the GPT-1900 model. The dataset includes full document text, publication year, title, source identifier, and OCR quality scores. It was created by author 'mhla' and last updated on March 29, -2026.

Use Cases

Training historical language models based on pre-1900 English text.
Analyzing linguistic change over time based on publication year metadata.
Filtering documents for quality based on OCR confidence and legibility scores.
Studying text source distribution based on the source dataset identifier.

Strengths

All documents have a confirmed publication year before 1900.
Includes metadata fields for title, source, and OCR quality assessment.
Specifically designed as a cleaned corpus for model training.

Limitations

Row count, file formats, and license information are unknown.
Column-level documentation is incomplete; the full description requires visiting the dataset page.
The dataset's last update date is in the future (2026-03-29), which may indicate a data entry error.

Provenance

Source: huggingface
Collection Method: Cleaned collection from unspecified source datasets.
Time Range: Pre-1900
Freshness: Last updated 2026-03-29 21:32:08; freshness should be verified.
Geography: null

The OCR score is -1.0 when unavailable, which may affect quality filtering. The future timestamp for 'last updated' is unusual.

Text Language Modeling Historical Text Pre 1900 English Language Natural Language Processing

Related Datasets

Quality Score

D38

Description

42

Source

36

Reputation

42

Access

26

Community

129 downloads

1 likes

0 views

Dataset Info

Author: mhla
Created: Feb 21, 2026
Updated: Mar 29, 2026
Last synced: May 7, 2026

Access

26

Community

129 downloads

1 likes

0 views

Dataset Info

Author: mhla
Created: Feb 21, 2026
Updated: Mar 29, 2026
Last synced: May 7, 2026

Pre-1900 Corpus: A Collection of Historical English Texts with Metadata

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info