Description

Conceptual 12M contains 12 million image-text pairs intended for vision-and-language pre-training. It was created by Google Research using a relaxed version of the data collection pipeline from Conceptual Captions 3M.

Use Cases

Train a vision-language model on 12 million image-text pairs for tasks like image captioning.
Fine-tune a contrastive learning model using the large-scale image-text pairs for cross-modal retrieval.
Use the image-text pairs for pre-training multimodal transformers to improve zero-shot classification.

Strengths

Contains 12 million image-text pairs, providing a large-scale resource for pre-training.
Built by Google Research, indicating a systematic collection and curation process.
Based on a pipeline from the established Conceptual Captions 3M dataset.

Limitations

Specific column structure, image sources, and annotation quality details are unknown.
The dataset's geographic and temporal coverage is not specified, which may limit generalizability.
The 'relaxed' collection pipeline may introduce more noise compared to its predecessor.

Provenance

Source: google-research-datasets
Collection Method: Collected using a relaxed version of the pipeline from Conceptual Captions 3M.
Freshness: Last updated on 2024-01-18.
Geography: Region tag indicates 'us', but specific spatial coverage is unknown.

License information is unknown; users should verify terms of use before downloading. The dataset is monolingual (English).

Source Datasetsoriginal Licenseother Task Categoriesimage To Text Languageen Language Creatorsfound Size Categories10 Mn100 M Task Idsimage Captioning Annotations Creatorsfound Arxiv210208981 Regionus Multilingualitymonolingual

12 Million Image-Text Pairs for Vision-Language Pre-training

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info