An evaluation benchmark derived from high-throughput screening (HTS) data, designed for classification and regression tasks. The dataset includes continuous inhibition activity percentages with associated standard error and standard deviation. It was created by Alma Celeste Castaneda Leautaud and published on Harvard Dataverse in May 2026.
Use Cases
- Benchmark classification models based on the dataset's enforced non-trivial class separability.
- Train regression models to predict inhibition activity based on the provided continuous activity values.
- Evaluate model robustness against noisy labels based on the described experimental variability and potential false positives/negatives.
- Simulate realistic virtual screening workflows based on the UMAP-sampled and clustered representation of the screening space.
Strengths
- Designed with realistic representation of virtual screening space using UMAP-based sampling and clustering.
- Provides both classification targets and continuous inhibition activity values with standard error and standard deviation.
- Assay optimization resulted in a high overall quality score (Z' = 0.86).
Limitations
- Continuous activity labels are inherently noisy due to primary HTS conditions, fluorescence readout, and experimental variability.
- Column-level documentation is absent; field semantics must be inferred after download.
- Row count is unknown, which may limit suitability assessment.
Provenance
- Source
- Harvard Dataverse
- Collection Method
- Derived from high-throughput screening (HTS) with UMAP-based sampling and clustering.
- Freshness
- Last updated 2026-05-06 03:21:15; freshness should be verified.