Anadolu OCR Corpus: Historical Ottoman Turkish and Arabic Text Pages

Name: Anadolu OCR Corpus: Historical Ottoman Turkish and Arabic Text Pages
Creator: fatihburakkaragoz
Published: 2026-05-12T17:09:40
Keywords: Historical Text, Ocr Output, Document Corpus, Text, Tabular, Natural Language Processing, Ottoman Turkish, Arabic Script

by fatihburakkaragozUpdated 1mo ago

Available on 1 platform

Sign in to view source links and access this dataset

Description

Anadolu OCR Corpus is an OpenCR export of OCR text and metadata for 52 historical PDF sources in Ottoman Turkish, Turkish, and Arabic. The dataset is provided in two Hugging Face configs: 'pages' and 'documents'. It was authored by fatihburakkaragoz and last updated on 2026-05-12.

Use Cases

Train or evaluate OCR models based on historical document text.
Analyze language distribution and script direction based on page-level metadata.
Perform document-level text analysis using concatenated text and markdown fields.
Study historical text patterns across Ottoman Turkish, Turkish, and Arabic sources.

Strengths

Contains OCR text and metadata for 52 distinct historical PDF sources.
Provides structured data in two configs: page-level and document-level rows.
Includes validation status, language detection, and script direction metadata.

Limitations

Column-level documentation is absent; field semantics must be inferred after download.
Row count is unknown, which may limit suitability assessment.
Data may reflect geographic, temporal, or source bias inherent to the 52 selected PDFs.

Provenance

Source: huggingface
Collection Method: OpenCR export of OCR text and metadata from PDF sources.
Freshness: Last updated 2026-05-12 17:14:52; freshness should be verified.

License is unknown; terms of use should be verified before application.

Text Tabular Historical Text Ocr Output Document Corpus Natural Language Processing Ottoman Turkish Arabic Script

Related Datasets

Quality Score

D39

Description

42

Source

39

Reputation

39

Access

26

Community

19 downloads

1 likes

0 views

Dataset Info

Author: fatihburakkaragoz
Created: May 12, 2026
Updated: May 12, 2026
Last synced: May 21, 2026

Access

26

Community

19 downloads

1 likes

0 views

Dataset Info

Author: fatihburakkaragoz
Created: May 12, 2026
Updated: May 12, 2026
Last synced: May 21, 2026

Anadolu OCR Corpus: Historical Ottoman Turkish and Arabic Text Pages

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info