Description

Indian multilingual document images and OCR transcriptions curated by MILA: MULTILINGUAL INDIC LANGUAGE ARCHIVE. This representative subset contains samples spanning 19 Indian languages and scripts, focusing on real-world documents with complex layouts and noisy scans. The full dataset, covering all 22 official languages, is scheduled for release upon paper acceptance.

Use Cases

Benchmarking OCR accuracy based on real-world multilingual Indian documents
Training Document-VLM systems based on aligned image-text pairs
Studying mixed-script OCR scenarios based on documents containing multiple languages per page
Developing language identification models based on file names encoding language presence
Analyzing OCR performance on noisy scans and complex formatting patterns mentioned in the description

Strengths

Focuses on 19 Indian languages and scripts, including Assamese, Bengali, Hindi, Tamil, and Urdu
Contains real-world document images with corresponding OCR transcriptions
Sourced from authentic archival and institutional collaborations through legally compliant means

Limitations

Row count is unknown, which may limit suitability assessment
Column-level documentation is absent; field semantics must be inferred after download
This is a representative subset; the complete dataset with all document images and metadata is not yet released

Provenance

Source: MILA: MULTILINGUAL INDIC LANGUAGE ARCHIVE
Collection Method: Curated from authentic sources collected through archival and institutional collaborations
Freshness: Last updated 2026-05-07 06:18:52; freshness should be verified
Geography: India-centric

Multimodal Multilingual Benchmark Document Images Computer Vision OCR Synthetic

ISOB-Small-Hard: Indian Scripts OCR Benchmark Sample

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info