OmniDoc OCR Correction Bench: Document Images Paired with OCR-Text Prompts

Name: OmniDoc OCR Correction Bench: Document Images Paired with OCR-Text Prompts
Creator: andynoodles
Published: 2026-04-07T10:15:40
Keywords: Document Understanding, Vision Language Models, Document Markdown, Benchmark, Computer Vision, Ocr Correction, Multimodal

by andynoodlesUpdated 2mo ago

Available on 1 platform

Sign in to view source links and access this dataset

Description

A benchmark dataset pairs document images from OmniDocBench v1.5 with prompts containing PaddleOCR-extracted markdown text. The task is to correct OCR errors and restore proper formatting using the source image as reference. The dataset was created by author 'andynoodles' and was last updated on 2026-04-07.

Use Cases

Benchmarking VLM performance on OCR error correction based on image-text pairs.
Training models to restore document formatting based on the described source image reference.
Evaluating document-to-markdown conversion accuracy using the described prompt structure.

Strengths

Pairs document images with OCR-extracted text prompts, providing a direct benchmark for multimodal tasks.
Specifically designed for evaluating VLMs on OCR correction and formatting, offering a clear task definition.

Limitations

Row count is unknown, which may limit suitability assessment.
Column-level documentation is absent; field semantics must be inferred after download.
Description metadata is limited; actual data quality requires manual inspection after download.

Provenance

Source: huggingface
Collection Method: Pairs document images from OmniDocBench v1.5 with PaddleOCR-extracted markdown text.
Freshness: Last updated 2026-04-07 10:41:24; freshness should be verified.

License is unknown, which may restrict usage.

Multimodal Document Understanding Vision Language Models Document Markdown Benchmark Computer Vision Ocr Correction

Related Datasets

Quality Score

D38

Description

42

Source

36

Reputation

43

Access

26

Community

154 downloads

1 likes

0 views

Dataset Info

Author: andynoodles
Created: Apr 7, 2026
Updated: Apr 7, 2026
Last synced: May 6, 2026

Access

26

Community

154 downloads

1 likes

0 views

Dataset Info

Author: andynoodles
Created: Apr 7, 2026
Updated: Apr 7, 2026
Last synced: May 6, 2026

OmniDoc OCR Correction Bench: Document Images Paired with OCR-Text Prompts

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info