Conceptual Captions (CC3M) contains approximately 3.3 million images annotated with captions. The dataset was created by pixparse, with images and their raw descriptions harvested from the web, specifically from the Alt-text HTML attribute.
Use Cases
- Train an image captioning model on 3.3M web-harvested images using their associated Alt-text descriptions.
- Analyze the stylistic variety of Alt-text captions compared to curated caption datasets.
- Pre-train a vision-language model on a large-scale corpus of web images and their raw textual descriptions.
Strengths
- Approximately 3.3 million image-caption pairs provide substantial scale for model training.
- Captions harvested from web Alt-text represent a wider variety of descriptive styles than curated annotations.
- Dataset is actively maintained, with a last recorded update in December 2023.
Limitations
- Captions are raw web Alt-text, which may contain noise, inaccuracies, or non-descriptive text.
- Specific image sources, geographic distribution, and temporal range are not detailed in the provided input.
- Dataset structure details like columns, file formats, and size are unknown from the input.
Provenance
- Source
- Web images and their associated Alt-text HTML attributes.
- Collection Method
- Harvested from the web.
- Time Range
- null
- Freshness
- 2023-12-15
- Geography
- null