Loading...
Loading...
Available on 1 platform
Sign in to view source links and access this dataset
PixelProse contains 16,896,214 image-caption pairs featuring dense synthetic descriptions generated by Gemini 1.0 Pro Vision. Released in 2024 by researchers at the University of Maryland (tomg-group-umd), the collection provides detailed textual representations for images sourced from CommonPool and CC12M.
Images are provided as tars or via external splits; users should refer to the associated arXiv paper (2406.10328) for methodology details regarding the vision-language model prompting.