DataSalon

Discover quality datasets for AI training — aggregated from 40+ platforms, curated by AI.

ProductSearch Datasets Browse Topics Rankings Community API / MCP

ResourcesDocumentation Blog Changelog Status

LegalPrivacy Policy Terms of Service Cookie Policy

CC12M Images 102: 90,000 Web-Sourced Images with Captions | DataSalon

Home Machine LearningCC12M Images 102: 90,000 Web-Sourced Images with Captions

Machine Learning

CC12M Images 102: 90,000 Web-Sourced Images with Captions

Name: CC12M Images 102: 90,000 Web-Sourced Images with Captions
Creator: Neomi26
Published: 2026-06-02T04:04:47
Keywords: Image Captions, Image, Computer Vision, Tabular, Multimodal

by Neomi26·Updated 1mo ago

Available on 1 platform

Description

90,000 web-sourced images re-hosted as individual JPEG files for browser access. The dataset includes a manifest with columns for image URLs, source URLs, captions, and dimensions. It is part of a larger series of approximately 131 repositories maintained by Neomi26 on Hugging Face.

Use Cases

Training image captioning models based on the provided caption text.
Fine-tuning vision-language models using the image-caption pairs.
Conducting web image analysis based on the source_url metadata.
Preprocessing image data for model training based on the provided width and height dimensions.

Strengths

Contains 90,000 individual image files.
Includes a structured manifest file with columns for key, image_url, caption, and dimensions.
Is part of a larger, organized series of repositories for easier management.

Limitations

Description metadata is limited; actual data quality requires manual inspection after download.
Column-level documentation is absent; field semantics must be inferred after download.
Row count for the manifest file is unknown, which may limit suitability assessment.

Provenance

Source: Subset of the CC12M dataset, re-hosted by Neomi26.
Collection Method: Likely gathered via web scraping, as indicated by source_url column.
Freshness: Last updated 2026-06-02 05:23:54; freshness should be verified.

License is unknown; users must verify terms of use for the original CC12M dataset.

Image Tabular Multimodal Image Captions Computer Vision

Related Datasets

Quality Score

C40

Description

Source

Reputation

Quality Score

C40

Description

Source

Reputation

Access

Community

5 downloads

1 likes

0 views

Dataset Info

Author: Neomi26
Created: Jun 2, 2026
Updated: Jun 2, 2026
Last synced: Jun 8, 2026

Access

Community

5 downloads

1 likes

0 views

Dataset Info

Author: Neomi26
Created: Jun 2, 2026
Updated: Jun 2, 2026
Last synced: Jun 8, 2026

CC12M Images 102: 90,000 Web-Sourced Images with Captions

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info