Description

The Public Multimodal Dataset (PMD) contains 70 million image-text pairs with 68 million unique images. It was introduced in the FLAVA paper and aggregated from publicly-available sources including Conceptual Captions, WIT, Localized Narratives, RedCaps, COCO, SBU Captions, Visual Genome, and a subset of YFCC100M.

Use Cases

Train multimodal models like FLAVA on 70 million image-text pairs for vision-language representation learning.
Fine-tune image captioning models using the diverse text descriptions associated with the 68 million unique images.
Analyze the distribution and characteristics of image-text pairs aggregated from sources like Conceptual Captions, WIT, and COCO.
Benchmark cross-modal retrieval performance using the paired image and text data.

Strengths

Contains 70 million image-text pairs, providing a large-scale resource for multimodal learning.
Aggregates data from 9 established public datasets, offering diversity in content and annotation style.
Includes 68 million unique images, reducing potential redundancy in visual content.

Limitations

The dataset is an aggregation of existing datasets, which may inherit their respective biases, inconsistencies, and annotation noise.
No specific column structure or sample data is provided, making initial exploration and parsing potentially challenging.
The last update was in August 2022, so newer images and concepts are not included.

Provenance

Source: Aggregated from Conceptual Captions, Conceptual Captions 12M, WIT, Localized Narratives, RedCaps, COCO, SBU Captions, Visual Genome, and a subset of YFCC100M.
Collection Method: Collection of publicly-available image-text pairs datasets.
Freshness: Last updated on 2022-08-09.

The dataset is a large-scale aggregation; users should review the licenses and terms of the original constituent datasets (e.g., Conceptual Captions, COCO, YFCC100M) for specific usage restrictions. The specific data schema and file formats are not detailed in the provided input.

Source Datasetsoriginal Task Categoriesimage To Text Languageen Language Creatorsfound Size Categories10 Mn100 M Arxiv210301913 Task Idsimage Captioning Licensecc By 40 Annotations Creatorsfound Arxiv150400325 Arxiv211204482 Regionus Multilingualitymonolingual Arxiv211111431

Public Multimodal Dataset of 70 Million Image-Text Pairs

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info