Sign in to view source links and access this dataset
Description
1,981,157 synthetically generated chart images with ground truth annotations form this multimodal dataset. Created by the docling-project and last updated in July 2025, it is designed for training the SmolDocling model on chart-based document understanding. Charts were rendered at 120 DPI using visualization libraries like Matplotlib, Seaborn, and Pyecharts.
Use Cases
Train chart recognition models based on synthetically generated images of line, bar, and pie charts.
Develop chart-to-text systems based on the OTSL-formatted ground truth annotations.
Benchmark document AI models on a large-scale, diverse set of chart visualizations.
Fine-tune vision-language models for tasks requiring comprehension of data visualizations.
Strengths
Contains 1,981,157 samples, providing a large-scale resource for model training.