10,000+ hyper-detailed image descriptions and object-level annotations derived from the Open Images dataset. The data includes fine-grained attributes, spatial relationships, and dense scene narratives designed to improve vision-language model alignment.
Use Cases
- Fine-tune vision-language models for dense captioning using the detailed scene description field.
- Benchmark large vision-language models (LVLMs) on their ability to identify specific object attributes and spatial arrangements.
- Train text-to-image models to follow complex, multi-object prompts based on the provided ground-truth descriptions.
Strengths
- Contains over 10,000 images with human-refined, dense descriptions.
- Includes object-level metadata such as bounding boxes and specific attribute labels for every entity mentioned.
- Features a multi-stage annotation pipeline that integrates machine-generated drafts with expert human editing for factual precision.