12,000,000 English image-caption pairs derived from Google's Conceptual 12M dataset. The collection is structured in a TSV format containing image URLs, local filenames, and descriptive captions for each entry.
Use Cases
- Train multimodal embedding models using the image link and caption columns.
- Fine-tune text-to-image synthesis models by pairing the caption text with downloaded images.
- Benchmark image retrieval systems using the provided English captions as search queries.
Strengths
- 12,000,000 rows of image-text associations for large-scale model training.
- TSV file structure containing image links, downloaded file names, and captions.
- Cleaned data specifically optimized for TPU-VM environments.