DataSalon

Discover quality datasets for AI training — aggregated from 40+ platforms, curated by AI.

ProductSearch Datasets Browse Topics Rankings Community API / MCP

ResourcesDocumentation Blog Changelog Status

LegalPrivacy Policy Terms of Service Cookie Policy

Thai OCR Dataset: Synthetic Text, Table, and Official Documents | DataSalon

Home Computer VisionThai OCR Dataset: Synthetic Text, Table, and Official Documents

Computer Vision

Thai OCR Dataset: Synthetic Text, Table, and Official Documents

Name: Thai OCR Dataset: Synthetic Text, Table, and Official Documents
Creator: mekpro
Published: 2026-03-15T07:57:35
Keywords: Document Images, Computer Vision, OCR, Synthetic Data, Thai Language, Synthetic, Multimodal

by mekpro·Updated 3mo ago

Available on 1 platform

Description

A synthetic dataset of 4,000 Thai document images for OCR model training. It includes 1,000 general text samples, 1,000 table samples (invoices, budgets), and 2,000 official document samples (contracts, legal, police reports). The dataset was created by mekpro and last updated on March 15, 2026.

Use Cases

Fine-tuning OCR models for Thai text recognition based on the general text subset
Training models to extract structured data from Thai invoices and budgets based on the table subset
Developing systems to process Thai legal contracts and official letters based on the official document subset

Strengths

Contains 4,000 total samples, providing a substantial corpus for training
Includes three distinct subsets (text, table, official) covering 1,000, 1,000, and 2,000 samples respectively
Data is stored in Parquet format with embedded images, compatible with the Hugging Face dataset viewer

Limitations

Column-level documentation is absent; field semantics must be inferred after download
Row count is unknown, which may limit suitability assessment
Description metadata is limited; actual data quality requires manual inspection after download

Provenance

Source: huggingface
Collection Method: Synthetic generation
Freshness: Last updated 2026-03-15 08:02:09; freshness should be verified

License is unknown; restrictions should be verified before use.

Multimodal Document Images Computer Vision OCR Synthetic Data Thai Language Synthetic

Related Datasets

Quality Score

C41

Description

Source

Reputation

Quality Score

C41

Description

Source

Reputation

Access

Community

22 downloads

1 likes

0 views

Dataset Info

Author: mekpro
Created: Mar 15, 2026
Updated: Mar 15, 2026
Last synced: May 14, 2026

Access

Community

22 downloads

1 likes

0 views

Dataset Info

Author: mekpro
Created: Mar 15, 2026
Updated: Mar 15, 2026
Last synced: May 14, 2026

Thai OCR Dataset: Synthetic Text, Table, and Official Documents

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info