Name: MultiLang Code Parser Dataset: Parsed Source Code Across 10 Languages
Creator: jugalgajjar
Published: 2025-05-15T02:51:14
Keywords: Source Code, Tabular, Large Scale, Abstract Syntax Tree, Programming Languages

Description

MultiLang-Code-Parser-Dataset (MLCPD) provides a large-scale, unified dataset of parsed source code across 10 major programming languages. Each entry corresponds to a parsed source file and includes language metadata, code-level statistics, and a universal schema JSON representation. The dataset was created by jugalgajjar and last updated on October 23, 2025.

Use Cases

Train or evaluate cross-language code models based on the universal schema JSON representation.
Analyze code structure and complexity patterns based on the included AST node and line count statistics.
Benchmark parsing or normalization tools across different programming languages based on the unified format.
Study syntactic and semantic similarities between programming languages based on the consistent structural representation.

Strengths

Covers 10 major programming languages, providing cross-language scope.
Provides a universal schema JSON for consistent structural representation across languages.
Includes code-level statistics such as lines and AST nodes for each file.

Limitations

Row count and dataset size are unknown, which may limit suitability assessment.
Column-level documentation is absent; field semantics must be inferred after download.
Description metadata is limited; actual data quality requires manual inspection after download.

Provenance

Source: huggingface
Collection Method: Likely gathered by parsing source code files from various repositories.
Time Range: null
Freshness: Last updated 2025-10-23 20:25:41.
Geography: null

License is unknown; terms of use must be verified before application.

Tabular Source Code Large Scale Abstract Syntax Tree Programming Languages

MultiLang Code Parser Dataset: Parsed Source Code Across 10 Languages

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info