Sign in to view source links and access this dataset
Description
212,440 examples of molecular structures paired with NMR spectral information, split into training, validation, and test sets. The dataset includes SMILES strings, molecular formulas, atom counts, and tokenized NMR data for both proton and carbon NMR. It was created by SpectrumWorld and last updated on Hugging Face in June 2026.
Use Cases
Train machine learning models to predict NMR spectra based on molecular structure features like SMILES strings.
Validate computational chemistry models for spectral assignment using the provided proton and carbon NMR data.
Develop multi-task learning models that jointly predict molecular formula and spectral properties from structural inputs.
Strengths
Contains 212,440 total examples, providing a substantial corpus for model training.
Explicitly split into 169,863 training, 21,279 validation, and 21,298 test examples, facilitating machine learning workflows.
Includes multiple molecular representations such as SMILES strings, formulas, and atom counts.
Limitations
Column-level documentation is absent; field semantics must be inferred after download.
Description metadata is limited; actual data quality requires manual inspection after download.
Provenance
Source
SpectrumWorld on Hugging Face.
Collection Method
Likely compiled from computational chemistry simulations or public spectral databases, but the exact gathering method is not specified.
Freshness
Last updated 2026-06-05 07:06:53; freshness should be verified.
License information is unknown, which may restrict commercial or redistribution use.