QSAR-Biodeg: Molecular Properties for Biodegradability Prediction
by Eddie Bergman
arff
Available on 1 platform
Sign in to view source links and access this dataset
Description
A subsampled dataset for quantitative structure-activity relationship (QSAR) modeling of chemical biodegradability. It was created by Eddie Bergman from the original qsar-biodeg dataset on OpenML using a controlled random sampling procedure. The subsampling parameters include a seed of 1, a maximum of 2000 rows, 100 columns, and 10 classes, with stratification applied.
Use Cases
Predicting chemical biodegradability based on molecular property features.
Training classification models to categorize chemicals into up to 10 biodegradability classes.
Benchmarking feature selection algorithms on a dataset with 100 molecular descriptor columns.
Developing QSAR models for environmental risk assessment of new compounds.
Strengths
Subsampling was performed with a fixed random seed (1), ensuring reproducibility.
The creation method used stratified sampling, which likely preserves class distribution.
The dataset is derived from a known QSAR benchmark (qsar-biodeg) on OpenML.
Limitations
Row count is unknown, which may limit suitability assessment for large-scale modeling.
Column-level documentation is absent; field semantics must be inferred after download.
Description metadata is limited; actual data quality requires manual inspection after download.
Provenance
Source
OpenML dataset qsar-biodeg (ID 1494), subsampled by Eddie Bergman.
Collection Method
Algorithmic subsampling with random column and row selection, preserving class stratification.
License is listed as us-pd (public domain in the United States).