OCR Baseline: Benchmark Documents for Invoice, Receipt, and Logistics Processing

Name: OCR Baseline: Benchmark Documents for Invoice, Receipt, and Logistics Processing
Creator: Timokerr
Published: 2026-04-22T18:02:08
Keywords: Document Understanding, Invoices, Benchmark, Ocr Benchmark, Logistics, Receipts, Multimodal

by TimokerrUpdated 2mo ago

Available on 1 platform

Sign in to view source links and access this dataset

Description

A benchmark dataset for optical character recognition and document understanding, created by Timokerr and last updated on 2026-04-22. It contains source PDFs and corresponding JSON ground truth labels for documents across three domains: invoices, receipts, and logistics. The dataset is designed for benchmarking performance across 17 LLM models.

Use Cases

Benchmarking OCR model accuracy based on ground truth JSON labels.
Training document information extraction systems based on the invoice, receipt, and logistics domains.
Evaluating layout understanding capabilities based on the provided PDF and label pairs.

Strengths

Provides structured ground truth labels in JSON format for each source document.
Covers three distinct, commercially relevant document domains: invoices, receipts, and logistics.
Designed for benchmarking against 17 different LLM models, suggesting a standardized evaluation framework.

Limitations

Description metadata is limited; actual data quality, scale, and column definitions require manual inspection after download.
Row count, file formats, and license information are unknown, which may limit suitability assessment.

Provenance

Source: Timokerr on Hugging Face.
Collection Method: Likely collected and annotated for benchmarking purposes.
Time Range: null
Freshness: Last updated 2026-04-22 18:09:36.
Geography: null

License is unknown; users must verify permissions before use.

Multimodal Document Understanding Invoices Benchmark Ocr Benchmark Logistics Receipts

Related Datasets

Quality Score

D37

Description

42

Source

36

Reputation

39

Access

26

Community

14 downloads

1 likes

0 views

Dataset Info

Author: Timokerr
Created: Apr 22, 2026
Updated: Apr 22, 2026
Last synced: Apr 30, 2026

Access

26

Community

14 downloads

1 likes

0 views

Dataset Info

Author: Timokerr
Created: Apr 22, 2026
Updated: Apr 22, 2026
Last synced: Apr 30, 2026

OCR Baseline: Benchmark Documents for Invoice, Receipt, and Logistics Processing

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info