CleanPatrick is a large-scale benchmark for data cleaning built on the Fitzpatrick17k dermatology dataset. It contains dermatological images annotated with over 500,000 binary labels across three data quality issues. The dataset was created by Digital-Dermatology and was last updated in March 2026.
Use Cases
- Detecting off-topic samples in dermatology image collections based on the described quality labels.
- Identifying near-duplicate images to deduplicate datasets based on the benchmark's structure.
- Correcting label errors in medical image datasets based on the annotated binary labels for data quality.
Strengths
- Over 500,000 binary labels for data quality issues provide a substantial evaluation framework.
- Built on the established Fitzpatrick17k dermatology dataset, suggesting a relevant medical foundation.
- Specifically designed to measure three major data quality issues: off-topic samples, near-duplicates, and label errors.
Limitations
- Column-level documentation is absent; field semantics must be inferred after download.
- Row count and file size are unknown, which may limit suitability assessment.
- Description metadata is limited; actual data quality requires manual inspection after download.
Provenance
- Source
- Built on the Fitzpatrick17k dermatology dataset.
- Freshness
- Last updated 2026-03-23 11:18:57; freshness should be verified.