Name: Patent Markush Structure Images with CXSMILES Representations
Creator: docling-project
Published: 2026-03-19T15:21:59
Keywords: Image To Text, Image, Molecular Graphs, Benchmark, Optical Character Recognition, Chemical Structures, Patent Chemistry, Multimodal

Description

Datasets contain images of Markush chemical structures from patents paired with their CXSMILES string representations. The collection includes over 54,000 training samples from the USPTO-MOL-M source and multiple benchmark subsets for evaluation. The dataset was created by docling-project and was last updated in March 2026.

Use Cases

Train a model to convert patent Markush structure images into CXSMILES strings using the 54,785 training samples.
Benchmark model performance on the 74 USPTO Markush structures using ground truth OCR labels.
Evaluate model generalization on the 878 IP5 Markush structures from the benchmark subset.
Compare model outputs against ChemicalOCR predictions provided for the USPTO-MOL-M subset.
Test model accuracy on the 103 Mol2Smiles (M2S) benchmark samples.

Strengths

Contains 54,785 training samples in the uspto-mol-m-54k-new subset.
Includes multiple benchmark subsets (74, 103, and 878 samples) for structured evaluation.
Provides both ground truth and predicted OCR labels for comparison.

Limitations

The total number of rows across all subsets is not specified.
The specific image formats, resolutions, and annotation quality are unknown.
The dataset is focused on patent-derived structures, which may not represent other chemical domains.

Provenance

Source: Primarily derived from USPTO-MOL-M and other patent sources (USPTO, IP5).
Collection Method: Images of Markush structures were extracted from patents and paired with CXSMILES representations, with some subsets using ChemicalOCR predictions.
Freshness: Last updated March 2026.

The full description and details for the IP5-markush subset require visiting the external dataset page. License information is unknown.

Image Multimodal Image To Text Molecular Graphs Benchmark Optical Character Recognition Chemical Structures Patent Chemistry

Patent Markush Structure Images with CXSMILES Representations

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info