14,738 test cases across 804 Korean PDFs in 7 industrial document categories, designed to fill the gap in standardized Korean OCR evaluation. The benchmark was developed by ONTHEIT and last updated on the platform in April 2026. It addresses the lack of Korean-language focus in existing OCR benchmarks by using real-world documents.
Use Cases
- Benchmarking OCR model performance on Korean text based on the described test cases.
- Training OCR models for specific Korean document categories like contracts or medical records.
- Evaluating the multilingual capabilities of general-purpose OCR systems on Korean documents.
- Researching document layout understanding and text extraction for non-Latin scripts.
Strengths
- Contains 14,738 test cases, providing a substantial evaluation set.
- Covers 7 distinct industrial document categories, suggesting diversity in content.
- Based on 804 real-world Korean PDFs, indicating practical relevance.
Limitations
- Column-level documentation is absent; field semantics must be inferred after download.
- Row count is unknown, which may limit suitability assessment.
- Description metadata is limited; actual data quality requires manual inspection after download.
Provenance
- Source
- ONTHEIT
- Collection Method
- Likely collected and annotated from real-world Korean PDF documents.
- Freshness
- Last updated 2026-04-21 12:08:26; freshness should be verified.
- Geography
- South Korea (inferred from language focus)