Sign in to view source links and access this dataset
Description
A synthetic dataset of 4,000 Thai document images for OCR model training. It includes 1,000 general text samples, 1,000 table samples (invoices, budgets), and 2,000 official document samples (contracts, legal, police reports). The dataset was created by mekpro and last updated on March 15, 2026.
Use Cases
Fine-tuning OCR models for Thai text recognition based on the general text subset
Training models to extract structured data from Thai invoices and budgets based on the table subset
Developing systems to process Thai legal contracts and official letters based on the official document subset
Strengths
Contains 4,000 total samples, providing a substantial corpus for training
Includes three distinct subsets (text, table, official) covering 1,000, 1,000, and 2,000 samples respectively
Data is stored in Parquet format with embedded images, compatible with the Hugging Face dataset viewer
Limitations
Column-level documentation is absent; field semantics must be inferred after download
Row count is unknown, which may limit suitability assessment
Description metadata is limited; actual data quality requires manual inspection after download
Provenance
Source
huggingface
Collection Method
Synthetic generation
Freshness
Last updated 2026-03-15 08:02:09; freshness should be verified
License is unknown; restrictions should be verified before use.