Name: YFCC15M Image and Page URL Subset for Vision-Language Training
Creator: vishaal27
Published: 2024-01-08T01:18:43
Keywords: Image Text Pairs, Vision Language Models, Multimodal Learning, Computer Vision, Multimodal

Description

A subset of approximately 15 million image-text pairs from the YFCC100M dataset, curated for training vision-language models. It was prepared by author vishaal27 and uploaded to Hugging Face in January 2024. The dataset provides page URLs and direct image download URLs for each entry.

Use Cases

Training contrastive image-text models like CLIP using the provided image-download-urls and associated page metadata.
Benchmarking dataset robustness and quality by analyzing the ~15M subset against the full YFCC100M collection.
Preprocessing image-text data for model training using the img2dataset tool with the provided csv file of page-urls and image-download-urls.

Strengths

Approximately 15 million data points, providing a substantial scale for model training.
Includes both page-urls and image-download-urls, facilitating efficient data downloading and linking.

Limitations

The exact number of rows, column definitions, and sample data are not provided in the input.
Image quality, text relevance, and potential biases from the source YFCC100M dataset are not detailed.

Provenance

Source: Subset of the YFCC100M dataset.
Collection Method: Curated for the paper 'Quality Not Quantity: On the Interaction between Dataset Design and Robustness of CLIP'.
Freshness: Last updated on the platform in January 2024.

Images are not included in the repository; users must download them using the provided URLs and a tool like img2dataset. The license for the underlying YFCC100M data and this specific subset is not stated.

Multimodal Image Text Pairs Vision Language Models Multimodal Learning Computer Vision

YFCC15M Image and Page URL Subset for Vision-Language Training

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info