Name: Synthetic Chinese Document Images For OCR-Free Understanding
Creator: naver-clova-ix
Published: 2022-07-20T00:42:55
Keywords: Size Categories10 Kn100 K, Librarypolars, Librarydask, Modalitytext, Librarymlcroissant, Modalityimage, Librarydatasets, Parquet, Regionus

Description

Encompassing 0.5 million synthetic Chinese document images generated by the SynthDoG tool for training the Donut model. It is part of a multi-language collection created by naver-clova-ix and was last updated in January 2024.

Use Cases

Train the Donut transformer model on synthetic Chinese document images for OCR-free understanding tasks.
Fine-tune vision-language models using the Chinese-language image-text pairs generated by SynthDoG.
Benchmark document understanding models on a large-scale synthetic Chinese dataset of 0.5M samples.

Strengths

Contains 0.5 million synthetic samples, providing substantial scale for model training.
Part of a multi-language collection including English, Japanese, and Korean variants, enabling comparative studies.
Specifically designed for the Donut (OCR-Free Document Understanding Transformer) model, ensuring task relevance.

Limitations

Data is entirely synthetic, which may not fully capture the noise and variation of real-world documents.
The specific content, document types, and annotation schema within the images are not described.
No information is provided on the distribution of document layouts or text complexity.

Provenance

Source: naver-clova-ix
Collection Method: Synthetically generated by the SynthDoG tool.
Freshness: Last updated on 2024-01-31.
Geography: Content is in Chinese, but geographic origin of source templates is unspecified.

The dataset is intended for use with the Donut model architecture. Users should review the associated GitHub repository and paper for generation details and licensing information.

Parquet Size Categories10 Kn100 K Librarypolars Librarydask Modalitytext Librarymlcroissant Modalityimage Librarydatasets Regionus

Synthetic Chinese Document Images For OCR-Free Understanding

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info