Indian multilingual document images and OCR transcriptions curated by MILA: MULTILINGUAL INDIC LANGUAGE ARCHIVE. This representative subset contains samples spanning 19 Indian languages and scripts, focusing on real-world documents with complex layouts and noisy scans. The full dataset, covering all 22 official languages, is scheduled for release upon paper acceptance.
Use Cases
- Benchmarking OCR accuracy based on real-world multilingual Indian documents
- Training Document-VLM systems based on aligned image-text pairs
- Studying mixed-script OCR scenarios based on documents containing multiple languages per page
- Developing language identification models based on file names encoding language presence
- Analyzing OCR performance on noisy scans and complex formatting patterns mentioned in the description
Strengths
- Focuses on 19 Indian languages and scripts, including Assamese, Bengali, Hindi, Tamil, and Urdu
- Contains real-world document images with corresponding OCR transcriptions
- Sourced from authentic archival and institutional collaborations through legally compliant means
Limitations
- Row count is unknown, which may limit suitability assessment
- Column-level documentation is absent; field semantics must be inferred after download
- This is a representative subset; the complete dataset with all document images and metadata is not yet released
Provenance
- Source
- MILA: MULTILINGUAL INDIC LANGUAGE ARCHIVE
- Collection Method
- Curated from authentic sources collected through archival and institutional collaborations
- Freshness
- Last updated 2026-05-07 06:18:52; freshness should be verified
- Geography
- India-centric