Name: Patent Markush Structure Images And CXSMILES Representations
Creator: docling-project
Published: 2026-03-19T15:21:59
Keywords: Image, Image To Smiles, Chemical Patents, Molecular Representation, Benchmark, Optical Character Recognition, Markush Structures, Multimodal

Description

Four benchmark datasets contain images of chemical Markush structures from patents and their corresponding CXSMILES string representations. The largest subset, 'uspto-mol-m-54k-new', includes 54,785 training samples. The datasets were created by docling-project and were last updated in March 2026.

Use Cases

Train image-to-text models to convert Markush structure images into CXSMILES strings using the 'uspto-mol-m-54k-new' training set.
Evaluate model performance on the 'uspto-markush' benchmark of 74 ground truth OCR samples.
Benchmark conversion accuracy on the 'm2s' subset of 103 Mol2Smiles (M2S) samples.
Test model generalization on the 'IP5-markush' collection of 878 Markush structures from international patent offices.

Strengths

Largest training subset contains 54,785 samples for model training.
Provides four distinct benchmark subsets for evaluation, totaling over 1,200 test/benchmark samples.
Includes ground truth OCR annotations for precise model evaluation.

Limitations

Total dataset size and file formats are unspecified.
Limited geographic and temporal coverage details are provided for the patent sources.
The 'uspto-markush' and 'm2s' benchmark sets are relatively small (74 and 103 samples).

Provenance

Source: Primarily derived from USPTO (United States Patent and Trademark Office) and IP5 patent office documents.
Collection Method: Images of Markush structures extracted from patents, with corresponding CXSMILES representations generated.
Freshness: Last updated March 2026.
Geography: International, with focus on USPTO and IP5 patent offices.

License information is not provided; users should verify terms of use. The full dataset description is hosted externally on Hugging Face.

Image Multimodal Image To Smiles Chemical Patents Molecular Representation Benchmark Optical Character Recognition Markush Structures

Patent Markush Structure Images And CXSMILES Representations

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info