India-Centric Image–Text Pairs Dataset is a multilingual collection of document images paired with OCR transcriptions. It includes samples from 22 Indian languages, such as Bengali, Hindi, Kannada, Malayalam, Marathi, Sanskrit, Tamil, and Telugu. The dataset was created by MILA: MULTILINGUAL INDIC LANGUAGE ARCHIVE and last updated on 2026-05-07.
Use Cases
- Multilingual OCR training based on real-world document samples with mixed scripts and font diversity
- Fine-tuning Document-VLMs for document understanding based on aligned image-text pairs
- Evaluating text recognition models based on samples with scanning artifacts and layout variations
- Developing models for multilingual document comprehension based on language-labeled entries
- Joint layout-text understanding tasks based on full-page or cropped document images
Strengths
- Includes samples from 22 Indian languages, covering diverse scripts
- Pairs document images with clean machine-readable OCR text
- Contains metadata such as language, document type, font style, and data source details
Limitations
- Row count is unknown, which may limit suitability assessment
- Column-level documentation is absent; field semantics must be inferred after download
Provenance
- Source
- MILA: MULTILINGUAL INDIC LANGUAGE ARCHIVE
- Collection Method
- Likely collected from real-world document samples
- Freshness
- Last updated 2026-05-07 09:14:39; freshness should be verified
- Geography
- India