India-Centric Image–Text Pairs for OCR and Document-VLM Research

Name: India-Centric Image–Text Pairs for OCR and Document-VLM Research
Creator: Anonymous, Anonymous
Published: 2026-05-07T06:21:02
Keywords: Image, Vision Language, Benchmark, Document Images, Computer Vision, Text, Multilingual, OCR

by Anonymous, Anonymous / MILA: MULTILINGUAL INDIC LANGUAGE ARCHIVEUpdated 2mo ago

Available on 1 platform

Sign in to view source links and access this dataset

Description

India-Centric Image–Text Pairs Dataset is a multilingual collection of document images paired with OCR transcriptions. It includes samples from 22 Indian languages, such as Bengali, Hindi, Kannada, Malayalam, Marathi, Sanskrit, Tamil, and Telugu. The dataset was created by MILA: MULTILINGUAL INDIC LANGUAGE ARCHIVE and last updated on 2026-05-07.

Use Cases

Multilingual OCR training based on real-world document samples with mixed scripts and font diversity
Fine-tuning Document-VLMs for document understanding based on aligned image-text pairs
Evaluating text recognition models based on samples with scanning artifacts and layout variations
Developing models for multilingual document comprehension based on language-labeled entries
Joint layout-text understanding tasks based on full-page or cropped document images

Strengths

Includes samples from 22 Indian languages, covering diverse scripts
Pairs document images with clean machine-readable OCR text
Contains metadata such as language, document type, font style, and data source details

Limitations

Row count is unknown, which may limit suitability assessment
Column-level documentation is absent; field semantics must be inferred after download

Provenance

Source: MILA: MULTILINGUAL INDIC LANGUAGE ARCHIVE
Collection Method: Likely collected from real-world document samples
Freshness: Last updated 2026-05-07 09:14:39; freshness should be verified
Geography: India

Image Text Multilingual Vision Language Benchmark Document Images Computer Vision OCR

Related Datasets

Quality Score

C41

Description

48

Source

41

Reputation

35

Access

31

Community

0 views

Dataset Info

Author: Anonymous, Anonymous
Org: MILA: MULTILINGUAL INDIC LANGUAGE ARCHIVE
Created: May 7, 2026
Updated: May 7, 2026
Last synced: May 19, 2026

Access

31

Community

0 views

Dataset Info

Author: Anonymous, Anonymous
Org: MILA: MULTILINGUAL INDIC LANGUAGE ARCHIVE
Created: May 7, 2026
Updated: May 7, 2026
Last synced: May 19, 2026

India-Centric Image–Text Pairs for OCR and Document-VLM Research

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info