Name: NayanaOCR Corpus 2025: 1M-Page Multilingual Synthetic Document Dataset
Creator: Cognitive-Lab
Published: 2026-05-20T16:31:11
Keywords: Document Understanding, Optical Character Recognition, Computer Vision, Multilingual, Natural Language Processing, Synthetic Data, Visual Question Answering, Synthetic, Multimodal

Description

1 million pages of fully-parallel synthetic documents rendered in 22 languages for OCR, layout detection, and visual question answering tasks. The dataset was created by Cognitive-Lab and was last updated on the Hugging Face platform in May 2026. It is described as one of the largest open-source multilingual, multi-task document datasets, with the same ~45,700 source pages rendered in every language.

Use Cases

Training optical character recognition (OCR) models based on the 1 million synthetic document pages.
Evaluating layout detection algorithms using the parallel document structure across 22 languages.
Developing visual question answering (VQA) systems for documents based on the described multi-task corpus.
Benchmarking multilingual document understanding models on low-resource languages mentioned in the description.

Strengths

Large scale with 1 million document pages.
True parallel corpus structure across 22 languages.
Designed for multiple tasks: OCR, layout detection, and VQA.

Limitations

Column-level documentation is absent; field semantics must be inferred after download.
Row count and file formats are unknown, which may limit suitability assessment.
Data is synthetic, which may not fully capture the characteristics of real-world documents.

Provenance

Source: Cognitive-Lab on Hugging Face.
Collection Method: Synthetic generation; the description states it is a 'synthetic OCR + VQA corpus'.
Freshness: Last updated 2026-05-25 16:45:43; freshness should be verified.

License is unknown; terms of use must be verified before application.

Multimodal Multilingual Document Understanding Optical Character Recognition Computer Vision Natural Language Processing Synthetic Data Visual Question Answering Synthetic

NayanaOCR Corpus 2025: 1M-Page Multilingual Synthetic Document Dataset

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info