Conceptual Captions Web Images with Alt-Text Descriptions

Name: Conceptual Captions Web Images with Alt-Text Descriptions
Creator: pixparse
Published: 2023-12-14T18:06:04
Keywords: Licenseother, Task Categoriesimage To Text, Size Categories1 Mn10 M, Librarywebdataset, Modalitytext, Librarymlcroissant, Modalityimage, WEBDATASET, Librarydatasets, Regionus

by pixparseUpdated 2y ago

Available on 1 platform

Sign in to view source links and access this dataset

Description

Conceptual Captions (CC3M) contains approximately 3.3 million images annotated with captions. The dataset was created by pixparse, with images and their raw descriptions harvested from the web, specifically from the Alt-text HTML attribute.

Use Cases

Train an image captioning model on 3.3M web-harvested images using their associated Alt-text descriptions.
Analyze the stylistic variety of Alt-text captions compared to curated caption datasets.
Pre-train a vision-language model on a large-scale corpus of web images and their raw textual descriptions.

Strengths

Approximately 3.3 million image-caption pairs provide substantial scale for model training.
Captions harvested from web Alt-text represent a wider variety of descriptive styles than curated annotations.
Dataset is actively maintained, with a last recorded update in December 2023.

Limitations

Captions are raw web Alt-text, which may contain noise, inaccuracies, or non-descriptive text.
Specific image sources, geographic distribution, and temporal range are not detailed in the provided input.
Dataset structure details like columns, file formats, and size are unknown from the input.

Provenance

Source: Web images and their associated Alt-text HTML attributes.
Collection Method: Harvested from the web.
Time Range: null
Freshness: 2023-12-15
Geography: null

null

WEBDATASET Licenseother Task Categoriesimage To Text Size Categories1 Mn10 M Librarywebdataset Modalitytext Librarymlcroissant Modalityimage Librarydatasets Regionus

Related Datasets

Quality Score

D38

Description

39

Source

44

Reputation

34

Access

22

Community

15.6K downloads

48 likes

0 views

Dataset Info

Author: pixparse
Created: Dec 14, 2023
Updated: Dec 15, 2023
Last synced: Jul 26, 2026

Access

22

Community

15.6K downloads

48 likes

0 views

Dataset Info

Author: pixparse
Created: Dec 14, 2023
Updated: Dec 15, 2023
Last synced: Jul 26, 2026

Conceptual Captions Web Images with Alt-Text Descriptions

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info