Name: Nom Ocr Data: Historical Hán-Nôm Manuscript Pages for OCR Training
Creator: Aerbote88
Published: 2026-04-30T21:46:14
Keywords: Image, Han Nom, Multimodal Annotation, Optical Character Recognition, Computer Vision, Historical Manuscripts, Natural Language Processing, Multimodal

Description

An ongoing corpus of historical Hán-Nôm manuscript pages annotated for OCR training. Each page includes a high-resolution image, per-character bounding boxes with corrected text labels, and candidate alternates from two upstream OCR engines. The dataset is maintained by Aerbote88 and was last updated on May 11, 2026.

Use Cases

Train OCR models for historical scripts based on per-character bounding boxes and corrected text labels.
Compare OCR engine performance using candidate alternates from Kandianguji and Nôm Na Việt engines.
Analyze document structure based on reading-order column polygons for text, binding, marginalia, and commentary.
Study character-level uncertainty and metadata for improving OCR confidence scoring.

Strengths

Includes per-character bounding boxes with corrected text labels, providing precise ground truth.
Contains candidate alternates from two distinct upstream OCR engines for comparison.
Features reading-order column polygons that segment text, binding, marginalia, and commentary.
An ongoing project updated frequently with new data, as of May 2026.

Limitations

Row count, file formats, and dataset size are unknown, limiting suitability assessment.
Column-level documentation is absent; field semantics must be inferred after download.
License information is unknown, which may restrict usage.

Provenance

Source: Historical Hán-Nôm manuscripts.
Collection Method: Annotated pages with per-character bounding boxes and text labels; includes outputs from Kandianguji and Nôm Na Việt OCR engines.
Freshness: Last updated 2026-05-11 00:04:57; described as an ongoing project updated frequently.

License is unknown, which may impose usage restrictions.

Image Multimodal Han Nom Multimodal Annotation Optical Character Recognition Computer Vision Historical Manuscripts Natural Language Processing

Nom Ocr Data: Historical Hán-Nôm Manuscript Pages for OCR Training

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info