A dataset for training LaTeX OCR models to convert images of mathematical formulas into LaTeX source code. It was created by author harryrobert and last updated on 2026-04 03. The dataset is built with a 3-stage curriculum training pipeline and includes splits for different training stages.
Use Cases
- Train image-to-text models for mathematical formula recognition based on the described 3-stage curriculum pipeline.
- Benchmark OCR performance on synthetic and human handwriting based on the listed source datasets.
- Develop tools for digitizing scientific papers based on the dataset's focus on converting formula images to LaTeX.
Strengths
- Contains at least 574,490 samples in the 'mlp-train' split.
- Built with a structured 3-stage curriculum training pipeline, as described.
- Aggregates data from multiple sources, including synthetic and human handwriting samples.
Limitations
- Column-level documentation is absent; field semantics must be inferred after download.
- Row count for the full dataset is unknown, which may limit suitability assessment.
- Description metadata is limited; actual data quality requires manual inspection after download.
Provenance
- Source
- Aggregated from multiple Hugging Face datasets including linxy/LaTeX_OCR and OleehyO/latex-formulas.
- Collection Method
- Likely contains synthetic generation and human handwriting collection, as suggested by source names.
- Time Range
- null
- Freshness
- Last updated 2026-04-03 03:41:05; freshness should be verified.
- Geography
- null