Loading...
Loading...
Available on 1 platform
Sign in to view source links and access this dataset
12.4 million image-caption pairs constitute the largest public domain image-text dataset for training foundation models. The dataset was created by Spawning and released in October 2024, as indicated by the associated arXiv paper identifier. It features community-driven governance mechanisms aimed at reducing harm and supporting reproducibility.
License is identified as CDLA Permissive 2.0 by platform tags; users should verify terms. Data is stored in Parquet format and may require libraries like Polars, Dask, or Hugging Face Datasets for efficient loading.