Three categories of preference data—toxid-dpo-natural-v4, rawrr v2-1 stage 2, and no_robots—comprise this merged dataset. The samples focus on human-like conversational responses to prevent models from overfitting to rigid instruction-following templates.
Use Cases
- Train models using Direct Preference Optimization (DPO) to adopt a more natural tone based on the 'chosen' field
- Execute Odds Ratio Preference Optimization (ORPO) to mitigate overfitting to specific instruction formats using the merged preference pairs
- Fine-tune models like Yi 34B to be more open to answering by leveraging the human-like responses in the 'chosen' column
Strengths
- Includes the 'chosen' field sourced from the original no_robots dataset
- Aggregates samples from toxid-dpo-natural-v4 and rawrr v2-1 stage 2
- Designed for compatibility with Yi 34B model training using the ORPO algorithm