DataSalon

Discover quality datasets for AI training — aggregated from 40+ platforms, curated by AI.

ProductSearch Datasets Browse Topics Rankings Community API / MCP

ResourcesDocumentation Blog Changelog Status

LegalPrivacy Policy Terms of Service Cookie Policy

KazakhOCR: A Synthetic Benchmark for Multimodal Models in Low-Resource Scripts | DataSalon

Home Computer VisionKazakhOCR: A Synthetic Benchmark for Multimodal Models in Low-Resource Scripts

Computer Vision

KazakhOCR: A Synthetic Benchmark for Multimodal Models in Low-Resource Scripts

Name: KazakhOCR: A Synthetic Benchmark for Multimodal Models in Low-Resource Scripts
Creator: henrygagnier
Published: 2025-12-18T01:22:26
Keywords: Benchmark, Optical Character Recognition, Multilingual Scripts, Natural Language Processing, Kazakh Language, Synthetic Benchmark, Synthetic, Multimodal

by henrygagnier·Updated 3mo ago

Available on 1 platform

Description

KazakhOCR is a synthetic benchmark dataset for evaluating multimodal models on Optical Character Recognition (OCR) for the Kazakh language. It contains text in Arabic, Cyrillic, and Latin scripts. The dataset was curated by Henry Gagnier, Sophie Gagnier, and Ashwin Kirubakaran and is licensed under MIT.

Use Cases

Benchmarking OCR model performance based on the described synthetic Kazakh text data
Evaluating script recognition robustness based on the described Arabic, Cyrillic, and Latin Kazakh text
Training multimodal models for low-resource language applications based on the described Kazakh OCR task

Strengths

Explicitly supports three distinct scripts (Arabic, Cyrillic, Latin) for the Kazakh language.
Designed as a benchmark for evaluating multimodal models, providing a clear use case.

Limitations

Row count and dataset size are unknown, which may limit suitability assessment.
Column-level documentation is absent; field semantics must be inferred after download.

Provenance

Source: Curated by Henry Gagnier, Sophie Gagnier, Ashwin Kirubakaran.
Collection Method: Synthetically generated benchmark.
Freshness: Last updated 2026-03-22 20:26:42

License is listed as MIT in the description.

Multimodal Benchmark Optical Character Recognition Multilingual Scripts Natural Language Processing Kazakh Language Synthetic Benchmark Synthetic

Related Datasets

Quality Score

D39

Description

Source

Reputation

Quality Score

D39

Description

Source

Reputation

Access

Community

547 downloads

1 likes

0 views

Dataset Info

Author: henrygagnier
Created: Dec 18, 2025
Updated: Mar 22, 2026
Last synced: Jun 7, 2026

Access

Community

547 downloads

1 likes

0 views

Dataset Info

Author: henrygagnier
Created: Dec 18, 2025
Updated: Mar 22, 2026
Last synced: Jun 7, 2026

KazakhOCR: A Synthetic Benchmark for Multimodal Models in Low-Resource Scripts

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info