Name: Language Decoded: Multilingual Python Code Datasets for Model Training
Creator: legesher
Published: 2026-03-12T19:45:19
Keywords: Python, Text, Multilingual, Nlp Training, Multilingual Code, Programming Languages

Description

Language Decoded Data is a multilingual code dataset for the Language Decoded project, part of Cohere's research. The dataset includes configurations for Phase 3 with sizes of 103k, 20k, and 5k rows for Conditions 1 and 2, and Phase 2 configurations remain available for reproducibility. It was last updated by user 'legesher' on Hugging Face on 2026-05-31.

Use Cases

Training multilingual code generation models based on the described multilingual Python code data.
Experimenting with the impact of native code on model performance based on the project's research focus.
Reproducing results from the Language Decoded project using the available Phase 2 configurations.
Benchmarking model performance across different dataset sizes (e.g., 103k, 20k, 5k) as mentioned in the description.

Strengths

Includes specific dataset sizes (103k, 20k, 5k) for different experimental conditions.
Maintains Phase 2 configurations for reproducibility of prior research.
Last update timestamp (2026-05-31) is explicitly provided.

Limitations

Column-level documentation is absent; field semantics must be inferred after download.
Row count is unknown for the overall dataset, which may limit suitability assessment.
Description metadata is limited; actual data quality requires manual inspection after download.

Provenance

Source: legesher on Hugging Face, part of Cohere's Language Decoded project.
Freshness: Last updated 2026-05-31 11:05:25; freshness should be verified.

License is unknown; terms of use must be verified before application.

Text Multilingual Python Nlp Training Multilingual Code Programming Languages

Language Decoded: Multilingual Python Code Datasets for Model Training

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info