Name: Ling-Coder-DPO: 250k Samples for Code Model Preference Tuning
Creator: inclusionAI
Published: 2025-03-08T11:16:50
Keywords: Text, Ai Training, Code Generation, Large Scale, Synthetic Data, Synthetic

Description

Ling-Coder-DPO is a subset of 250,000 samples used for Direct Preference Optimization (DPO) training of the Ling-Coder Lite model. The dataset was created by inclusionAI and last updated on Hugging Face on March 27, 2025. It is part of a larger collection that also includes a supervised fine-tuning (SFT) subset with over 5 million samples and a synthetic question-answering subset.

Use Cases

Training code generation models via Direct Preference Optimization based on the described DPO data.
Fine-tuning language models for programming tasks using the companion SFT dataset mentioned in the description.
Augmenting training pipelines with synthetic programming questions and answers as referenced in the description.

Strengths

Contains 250,000 samples specifically for Direct Preference Optimization training.
Part of a larger suite including an SFT subset with over 5 million samples.
Explicitly created for training a named model, Ling-Coder Lite, indicating a clear purpose.

Limitations

Column-level documentation is absent; field semantics must be inferred after download.
Row count for the full Ling-Coder-DPO subset is known, but the size of the synthetic QA subset is not fully specified.
Description metadata is limited; actual data quality requires manual inspection after download.

Provenance

Source: inclusionAI via Hugging Face
Collection Method: Likely created for training the Ling-Coder Lite model, potentially involving synthetic generation and curation.
Time Range: null
Freshness: Last updated 2025-03-27 12:39:34; freshness should be verified.
Geography: null

License is unknown, which may restrict commercial use.

Text Ai Training Code Generation Large Scale Synthetic Data Synthetic

Ling-Coder-DPO: 250k Samples for Code Model Preference Tuning

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info