Name: LaTeX OCR V2: Images of Mathematical Formulas for Optical Character Recognition
Creator: harryrobert
Published: 2026-04-03T02:02:56
Keywords: Image To Text, Task Categoriesimage To Text, Image, Languageen, Latex Ocr, Mathematics, Latex, Text, Regionus, Formula Recognition, OCR, Math

Description

A dataset for training LaTeX OCR models to convert images of mathematical formulas into LaTeX source code. It was created by author harryrobert and last updated on 2026-04 03. The dataset is built with a 3-stage curriculum training pipeline and includes splits for different training stages.

Use Cases

Train image-to-text models for mathematical formula recognition based on the described 3-stage curriculum pipeline.
Benchmark OCR performance on synthetic and human handwriting based on the listed source datasets.
Develop tools for digitizing scientific papers based on the dataset's focus on converting formula images to LaTeX.

Strengths

Contains at least 574,490 samples in the 'mlp-train' split.
Built with a structured 3-stage curriculum training pipeline, as described.
Aggregates data from multiple sources, including synthetic and human handwriting samples.

Limitations

Column-level documentation is absent; field semantics must be inferred after download.
Row count for the full dataset is unknown, which may limit suitability assessment.
Description metadata is limited; actual data quality requires manual inspection after download.

Provenance

Source: Aggregated from multiple Hugging Face datasets including linxy/LaTeX_OCR and OleehyO/latex-formulas.
Collection Method: Likely contains synthetic generation and human handwriting collection, as suggested by source names.
Time Range: null
Freshness: Last updated 2026-04-03 03:41:05; freshness should be verified.
Geography: null

null

Image Text Image To Text Task Categoriesimage To Text Languageen Latex Ocr Mathematics Latex Regionus Formula Recognition OCR Math

LaTeX OCR V2: Images of Mathematical Formulas for Optical Character Recognition

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info