Sign in to view source links and access this dataset
Description
1,587,709 samples of chemical reactions, with a median of 7 molecules per sample. The dataset was created by IDEA-AI4S and was last updated on Hugging Face in October 2024. It is derived from USPTO patent data, likely containing interleaved representations of reaction sequences.
Use Cases
Training machine learning models for chemical reaction prediction based on interleaved sequence data.
Developing generative models for novel molecule synthesis based on reaction patterns.
Benchmarking AI systems on tasks like retrosynthesis planning using patent-derived reaction data.
Analyzing common molecular motifs and transformations present in patented chemical processes.
Strengths
Contains 1,587,709 samples, providing a substantial scale for model training.
Offers a median of 7 molecules per sample, suggesting detailed reaction contexts.
Limitations
Column-level documentation is absent; field semantics must be inferred after download.
Description metadata is limited; actual data quality requires manual inspection after download.
Row count is unknown, which may limit suitability assessment.
Provenance
Source
USPTO (United States Patent and Trademark Office) patent data.
Collection Method
Processed and interleaved by IDEA-AI4S.
Freshness
Last updated 2024-10-22 02:34:10; freshness should be verified.
A citation to the associated paper is required for any publications or projects using this dataset.