PD3M is a subset of the PD12M dataset, containing 3.3 million image-caption pairs filtered for the highest aesthetic scores. PD12M is the largest public domain image-text dataset to date, designed for training foundation models while minimizing copyright concerns. The dataset was created by Spawning and introduces community-driven governance mechanisms via the Source.Plus platform.
Use Cases
- Training image-text foundation models based on the large-scale public domain corpus.
- Fine-tuning vision-language models using high-quality aesthetic image-caption pairs.
- Benchmarking model performance on datasets with community-driven governance mechanisms.
- Studying the impact of aesthetic filtering on multimodal model training.
Strengths
- Contains 3.3 million image-caption pairs.
- Images are filtered for the highest aesthetic scores from the larger PD12M dataset.
- Source is the largest public domain image-text dataset, PD12M.
- Dataset governance mechanisms via Source.Plus aim to reduce harm and support reproducibility.
Limitations
- Column-level documentation is absent; field semantics must be inferred after download.
- Row count is unknown, which may limit suitability assessment.
- Description metadata is limited; actual data quality requires manual inspection after download.
Provenance
- Source
- Spawning
- Collection Method
- Subset of PD12M filtered for highest aesthetic scores.
- Freshness
- Last updated 2024-11-19 20:29:12; freshness should be verified.