Sign in to view source links and access this dataset
Description
MultiLang-Code-Parser-Dataset (MLCPD) provides a large-scale, unified dataset of parsed source code across 10 major programming languages. Each entry corresponds to a parsed source file and includes language metadata, code-level statistics, and a universal schema JSON representation. The dataset was created by jugalgajjar and last updated on October 23, 2025.
Use Cases
Train or evaluate cross-language code models based on the universal schema JSON representation.
Analyze code structure and complexity patterns based on the included AST node and line count statistics.
Benchmark parsing or normalization tools across different programming languages based on the unified format.
Study syntactic and semantic similarities between programming languages based on the consistent structural representation.
Strengths
Covers 10 major programming languages, providing cross-language scope.
Provides a universal schema JSON for consistent structural representation across languages.
Includes code-level statistics such as lines and AST nodes for each file.
Limitations
Row count and dataset size are unknown, which may limit suitability assessment.
Column-level documentation is absent; field semantics must be inferred after download.
Description metadata is limited; actual data quality requires manual inspection after download.
Provenance
Source
huggingface
Collection Method
Likely gathered by parsing source code files from various repositories.
Time Range
null
Freshness
Last updated 2025-10-23 20:25:41.
Geography
null
License is unknown; terms of use must be verified before application.