KDoc-OCRBench: Korean Document OCR Benchmark with 14,738 Test Cases

Name: KDoc-OCRBench: Korean Document OCR Benchmark with 14,738 Test Cases
Creator: ONTHEIT
Published: 2026-04-14T13:12:15
Keywords: Document Benchmark, Korean Ocr, Benchmark, Healthcare, Computer Vision, Text, Multilingual Nlp, Multimodal

by ONTHEITUpdated 2mo ago

Available on 1 platform

Sign in to view source links and access this dataset

Description

14,738 test cases across 804 Korean PDFs in 7 industrial document categories, designed to fill the gap in standardized Korean OCR evaluation. The benchmark was developed by ONTHEIT and last updated on the platform in April 2026. It addresses the lack of Korean-language focus in existing OCR benchmarks by using real-world documents.

Use Cases

Benchmarking OCR model performance on Korean text based on the described test cases.
Training OCR models for specific Korean document categories like contracts or medical records.
Evaluating the multilingual capabilities of general-purpose OCR systems on Korean documents.
Researching document layout understanding and text extraction for non-Latin scripts.

Strengths

Contains 14,738 test cases, providing a substantial evaluation set.
Covers 7 distinct industrial document categories, suggesting diversity in content.
Based on 804 real-world Korean PDFs, indicating practical relevance.

Limitations

Column-level documentation is absent; field semantics must be inferred after download.
Row count is unknown, which may limit suitability assessment.
Description metadata is limited; actual data quality requires manual inspection after download.

Provenance

Source: ONTHEIT
Collection Method: Likely collected and annotated from real-world Korean PDF documents.
Freshness: Last updated 2026-04-21 12:08:26; freshness should be verified.
Geography: South Korea (inferred from language focus)

Text Multimodal Document Benchmark Korean Ocr Benchmark Healthcare Computer Vision Multilingual Nlp

Related Datasets

Quality Score

D37

Description

42

Source

36

Reputation

39

Access

22

Community

13 downloads

1 likes

0 views

Dataset Info

Author: ONTHEIT
Created: Apr 14, 2026
Updated: Apr 21, 2026
Last synced: Apr 29, 2026

Access

22

Community

13 downloads

1 likes

0 views

Dataset Info

Author: ONTHEIT
Created: Apr 14, 2026
Updated: Apr 21, 2026
Last synced: Apr 29, 2026

KDoc-OCRBench: Korean Document OCR Benchmark with 14,738 Test Cases

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info