Loading...
Loading...
Available on 1 platform
Sign in to view source links and access this dataset
A cleaned corpus of English-language texts published before the year 1900, intended for training the GPT-1900 model. The dataset includes full document text, publication year, title, source identifier, and OCR quality scores. It was created by author 'mhla' and last updated on March 29, -2026.
The OCR score is -1.0 when unavailable, which may affect quality filtering. The future timestamp for 'last updated' is unusual.