5,000 test images from the MSCOCO 2014 collection paired with human-annotated captions for image-text retrieval tasks. The data follows the Karpathy split, a standard benchmark for evaluating cross-modal alignment between visual features and natural language descriptions.
Use Cases
- Calculate Recall@K metrics for image-to-text retrieval by ranking caption strings against image embeddings.
- Benchmark text-to-image retrieval systems using the 5,000 images as a search corpus.
- Validate the performance of image captioning models by comparing generated text to the ground-truth human annotations.
Strengths
- 5,000 unique images sourced from the MSCOCO 2014 validation set.
- Includes multiple natural language captions per image for many-to-many retrieval evaluation.
- Formatted specifically for the Karpathy split as defined in the Stanford deepimagesent repository.