Name: Small-Molecule Natural Product Data for Foundation Model Pretraining and Bioactivity Tasks
Creator: Zhenming Liu
Published: 2026-04-30T07:02:32
License: CC-BY-4.0
Keywords: Foundation Model, Bioactivity, CSV, Natural products, Tabular, Small Molecules, Cheminformatics

Description

A collection of datasets for training and evaluating machine learning models on small-molecule natural products. The data, totaling 128.0 MB, was compiled by Zhenming Liu from multiple public databases including COCONUT, NPASS, LOTUS, and MIBiG. The collection was last updated on 2026-04-30.

Use Cases

Pretraining a foundation model for small molecules based on SMILES strings from the COCONUT database.
Classifying natural products into taxonomic categories using data prepared for the Natural Product Taxonomy Classification experiment.
Predicting bioactivity of natural products using regression data sourced from the NPASS database.
Predicting biological sources of compounds using data from the LOTUS database.
Mining biosynthetic gene clusters using data constructed from the MIBiG and Pfam databases.

Strengths

Data is aggregated from multiple authoritative public databases including COCONUT, NPASS, LOTUS, and MIBiG.
The collection is licensed under CC-BY-4.0, facilitating open reuse and redistribution.
Files are provided in common formats (PKL, CSV) suitable for machine learning workflows.

Limitations

Column-level documentation is absent; field semantics must be inferred after download.
Row count is unknown, which may limit suitability assessment for specific model scales.
Description metadata is limited; actual data quality requires manual inspection after download.

Provenance

Source: Multiple public databases: COCONUT, NPASS, LOTUS, MIBiG, Pfam, and a Zenodo archive.
Collection Method: Data was obtained, preprocessed, and curated from the listed source databases.
Time Range: null
Freshness: Last updated 2026-04-30 07:02:32; freshness should be verified.
Geography: null

null

Tabular CSV Foundation Model Bioactivity Natural products Small Molecules Cheminformatics

Small-Molecule Natural Product Data for Foundation Model Pretraining and Bioactivity Tasks

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info