10,000,000 image-caption pairs generated using the Florence-2 vision-language model for the Megalith-10M image collection. Textual descriptions supplement the previously uncaptioned CC-0 like images to support vision-language model training.
Use Cases
- Train text-to-image diffusion models using the Florence-2 captions as training prompts
- Fine-tune vision-language models for image captioning or visual question answering using the paired image and text data
- Build image search engines by mapping the Florence-2 text descriptions to the original Megalith-10M images
Strengths
- 10,000,000 image-caption pairs
- Captions generated using the Florence-2 vision-language model
- Derived from the CC-0 like Megalith-10M image repository