A collection of datasets for training and evaluating machine learning models on small-molecule natural products. The data, totaling 128.0 MB, was compiled by Zhenming Liu from multiple public databases including COCONUT, NPASS, LOTUS, and MIBiG. The collection was last updated on 2026-04-30.
Use Cases
- Pretraining a foundation model for small molecules based on SMILES strings from the COCONUT database.
- Classifying natural products into taxonomic categories using data prepared for the Natural Product Taxonomy Classification experiment.
- Predicting bioactivity of natural products using regression data sourced from the NPASS database.
- Predicting biological sources of compounds using data from the LOTUS database.
- Mining biosynthetic gene clusters using data constructed from the MIBiG and Pfam databases.
Strengths
- Data is aggregated from multiple authoritative public databases including COCONUT, NPASS, LOTUS, and MIBiG.
- The collection is licensed under CC-BY-4.0, facilitating open reuse and redistribution.
- Files are provided in common formats (PKL, CSV) suitable for machine learning workflows.
Limitations
- Column-level documentation is absent; field semantics must be inferred after download.
- Row count is unknown, which may limit suitability assessment for specific model scales.
- Description metadata is limited; actual data quality requires manual inspection after download.
Provenance
- Source
- Multiple public databases: COCONUT, NPASS, LOTUS, MIBiG, Pfam, and a Zenodo archive.
- Collection Method
- Data was obtained, preprocessed, and curated from the listed source databases.
- Time Range
- null
- Freshness
- Last updated 2026-04-30 07:02:32; freshness should be verified.
- Geography
- null