12 million image-text pairs sourced from 350 manually curated subreddits covering diverse objects and scenes. The dataset utilizes subreddit names as coarse labels to guide composition without requiring manual per-instance annotation.
Use Cases
- Train vision-language models for image captioning using the image-text pairs
- Perform zero-shot classification by leveraging the coarse labels derived from the 350 subreddit names
- Analyze linguistic variations in image descriptions across different subreddit communities
Strengths
- 12,000,000 image-text pairs collected from the Reddit platform
- Includes data from 350 manually curated subreddits providing topical diversity
- Uses subreddit names as coarse-grained labels for dataset steering