Sign in to view source links and access this dataset
Description
Argilla's 7,000-pair dataset, built with the distilabel tool, is designed for Direct Preference Optimization (DPO) training of chat models. This preview version, released on July 16, 2024, is based on the LDJnr/Capybara dataset and aims to address the scarcity of multi-turn dialogue preference data used in major RLHF works. A full version with more model responses is planned for a future release.
Use Cases
Fine-tuning chat models for improved multi-turn conversational ability based on the described preference data.
Training reward models for RLHF pipelines using the binarized chosen/rejected response pairs.
Benchmarking DPO algorithms and studying preference alignment in multi-turn dialogue contexts.
Augmenting existing instruction-tuning datasets with high-quality, curated preference data.
Strengths
Explicitly built for the critical task of DPO training, a method used by leading AI labs.
Focuses on multi-turn dialogue preference data, which the description notes is scarce.
Created using the distilabel data labeling framework, suggesting a structured generation process.
Limitations
Description metadata is limited; actual data quality, structure, and column semantics require manual inspection after download.
Row count is confirmed as 7,000 pairs, but the full scale of the base dataset and responses from more powerful models are reserved for a future version.
Column-level documentation is absent; field semantics must be inferred after download.
Provenance
Source
Argilla, built atop the LDJnr/Capybara dataset.
Collection Method
Constructed using the distilabel tool for generating Direct Preference Optimization (DPO) data.
Time Range
null
Freshness
Last updated 2024-07-16 13:30:29; freshness should be verified.
Geography
null
License is unknown; terms of use must be verified before application.