25,000,000 image-caption pairs structured for large-scale multimodal model training. The collection expands upon the 4M Img Caps framework to provide a higher volume of text-image associations for vision-language tasks.
Use Cases
- Train zero-shot image classifiers using the text and image alignment data
- Develop automated image captioning systems by mapping visual inputs to the provided text strings
- Benchmark text-to-image retrieval performance across 25 million potential candidates
Strengths
- 25,000,000 unique image-caption records
- Compatible with the data loading scripts and schema used for the 4M Img Caps dataset
- Optimized for large-scale pre-training of multimodal transformers and CLIP-style models