28,408 images from Open Images paired with 142,040 captions that require models to read and reason about text within the visual scene. This version is specifically formatted for the lmms-eval pipeline to facilitate standardized benchmarking of large multi-modality models.
Use Cases
- Benchmark the text-recognition accuracy of large multi-modality models using the provided image and caption fields
- Train image-to-text models to synthesize visual features and OCR data into coherent descriptions
- Measure model performance on reading comprehension within visual contexts via the lmms-eval pipeline
Strengths
- 142,040 human-annotated captions across 28,408 unique images
- Integration with the lmms-eval framework for automated benchmarking of multi-modality models
- Focuses on images containing legible text, such as signs, labels, and documents