Sign in to view source links and access this dataset
Description
90,000 web-sourced images re-hosted as individual JPEG files for browser access. The dataset includes a manifest with columns for image URLs, source URLs, captions, and dimensions. It is part of a larger series of approximately 131 repositories maintained by Neomi26 on Hugging Face.
Use Cases
Training image captioning models based on the provided caption text.
Fine-tuning vision-language models using the image-caption pairs.
Conducting web image analysis based on the source_url metadata.
Preprocessing image data for model training based on the provided width and height dimensions.
Strengths
Contains 90,000 individual image files.
Includes a structured manifest file with columns for key, image_url, caption, and dimensions.
Is part of a larger, organized series of repositories for easier management.
Limitations
Description metadata is limited; actual data quality requires manual inspection after download.
Column-level documentation is absent; field semantics must be inferred after download.
Row count for the manifest file is unknown, which may limit suitability assessment.
Provenance
Source
Subset of the CC12M dataset, re-hosted by Neomi26.
Collection Method
Likely gathered via web scraping, as indicated by source_url column.
Freshness
Last updated 2026-06-02 05:23:54; freshness should be verified.
License is unknown; users must verify terms of use for the original CC12M dataset.