QSAR-Biodeg: Molecular Properties for Biodegradability Prediction
by Eddie Bergman
arff
Available on 1 platform
Sign in to view source links and access this dataset
Description
A subsampled dataset for quantitative structure-activity relationship (QSAR) modeling of chemical biodegradability. It was created by Eddie Bergman from the original qsar-biodeg dataset on OpenML using a controlled random sampling process. The subsampling parameters included a seed of 4, a maximum of 2000 rows, 100 columns, and 10 classes, with stratification applied.
Use Cases
Training classification models to predict biodegradability class based on molecular descriptors.
Benchmarking feature selection algorithms on a controlled subset of chemical properties.
Developing QSAR models for regulatory assessment of chemical environmental persistence.
Studying the impact of dataset stratification on model performance for imbalanced chemical data.
Strengths
Subsampling was performed with a fixed random seed (4), ensuring reproducibility.
The creation process controlled for maximum rows (2000), columns (100), and classes (10).
Stratification was applied during row sampling, which likely preserves class distribution.
Limitations
The exact number of rows, columns, and their specific names are unknown from the metadata.
Description metadata is limited; actual data quality requires manual inspection after download.
Last update date is unknown; freshness unverified.
Provenance
Source
OpenML dataset qsar-biodeg (ID 1494), subsampled by Eddie Bergman.
Collection Method
Algorithmic subsampling from a parent dataset using random selection of rows, columns, and classes.
The dataset is released under a US public domain (us-pd) license.