Sign in to view source links and access this dataset
Description
Language Decoded Data is a multilingual code dataset for the Language Decoded project, part of Cohere's research. The dataset includes configurations for Phase 3 with sizes of 103k, 20k, and 5k rows for Conditions 1 and 2, and Phase 2 configurations remain available for reproducibility. It was last updated by user 'legesher' on Hugging Face on 2026-05-31.
Use Cases
Training multilingual code generation models based on the described multilingual Python code data.
Experimenting with the impact of native code on model performance based on the project's research focus.
Reproducing results from the Language Decoded project using the available Phase 2 configurations.
Benchmarking model performance across different dataset sizes (e.g., 103k, 20k, 5k) as mentioned in the description.
Strengths
Includes specific dataset sizes (103k, 20k, 5k) for different experimental conditions.
Maintains Phase 2 configurations for reproducibility of prior research.
Last update timestamp (2026-05-31) is explicitly provided.
Limitations
Column-level documentation is absent; field semantics must be inferred after download.
Row count is unknown for the overall dataset, which may limit suitability assessment.
Description metadata is limited; actual data quality requires manual inspection after download.
Provenance
Source
legesher on Hugging Face, part of Cohere's Language Decoded project.
Freshness
Last updated 2026-05-31 11:05:25; freshness should be verified.
License is unknown; terms of use must be verified before application.