Name: SynthCodeNet: 9.3 Million Synthetic Code Snippet Image-Text Pairs
Creator: docling-project
Published: 2025-07-15T09:50:22
Keywords: Computer Vision, Code Snippets, Large Scale, Synthetic Data, Synthetic, Multimodal, Programming Languages

Description

Over 9.3 million synthetically generated image-text pairs form this multimodal dataset created for training the SmolDocling model. The dataset covers code snippets from 56 different programming languages, with text sourced from permissively licensed sources and images generated at 120 DPI using LaTeX and Pygments. It was created by the docling-project and last updated on July 16, -2025.

Use Cases

Train multimodal models for code-to-image generation based on synthetically generated image-text pairs.
Benchmark code understanding models across 56 programming languages based on the dataset's coverage.
Develop visual documentation tools for code snippets based on the LaTeX and Pygments-generated images.
Fine-tune large language models on code representation tasks based on the permissively licensed text data.

Strengths

Contains over 9.3 million samples, providing a large-scale resource.
Covers code from 56 different programming languages, offering broad language diversity.
Images were generated at 120 DPI using LaTeX and Pygments, suggesting consistent visual quality.
Text data was sourced from permissively licensed sources, which may simplify legal use.

Limitations

Description metadata is limited; actual data quality requires manual inspection after download.
Column-level documentation is absent; field semantics must be inferred after download.

Provenance

Source: docling-project on Hugging Face
Collection Method: Text sourced from permissively licensed sources; images synthetically generated using LaTeX and Pygments.
Time Range: null
Freshness: Last updated 2025-07-16 07:15:17; freshness should be verified.
Geography: null

null

Multimodal Computer Vision Code Snippets Large Scale Synthetic Data Synthetic Programming Languages

SynthCodeNet: 9.3 Million Synthetic Code Snippet Image-Text Pairs

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info