Over 9.3 million synthetically generated image-text pairs form this multimodal dataset created for training the SmolDocling model. The dataset covers code snippets from 56 different programming languages, with text sourced from permissively licensed sources and images generated at 120 DPI using LaTeX and Pygments. It was created by the docling-project and last updated on July 16, -2025.
Use Cases
- Train multimodal models for code-to-image generation based on synthetically generated image-text pairs.
- Benchmark code understanding models across 56 programming languages based on the dataset's coverage.
- Develop visual documentation tools for code snippets based on the LaTeX and Pygments-generated images.
- Fine-tune large language models on code representation tasks based on the permissively licensed text data.
Strengths
- Contains over 9.3 million samples, providing a large-scale resource.
- Covers code from 56 different programming languages, offering broad language diversity.
- Images were generated at 120 DPI using LaTeX and Pygments, suggesting consistent visual quality.
- Text data was sourced from permissively licensed sources, which may simplify legal use.
Limitations
- Description metadata is limited; actual data quality requires manual inspection after download.
- Column-level documentation is absent; field semantics must be inferred after download.
Provenance
- Source
- docling-project on Hugging Face
- Collection Method
- Text sourced from permissively licensed sources; images synthetically generated using LaTeX and Pygments.
- Time Range
- null
- Freshness
- Last updated 2025-07-16 07:15:17; freshness should be verified.
- Geography
- null