This DPO dataset contains pairs of harmful prompts and model responses derived from the LLM-LAT/harmful-dataset. It reconfigures the preference structure by labeling standard model refusals as 'rejected' and the original harmful or incorrect answers as 'chosen'.
Use Cases
- Fine-tune models using Direct Preference Optimization to minimize refusal behaviors using the 'chosen' and 'rejected' fields
- Conduct safety alignment research by analyzing the delta between the 'rejected' refusal text and the 'chosen' harmful text
- Develop adversarial testing suites for LLMs based on the harmful prompt-response pairs provided in the dataset
Strengths
- Derived directly from the LLM-LAT/harmful-dataset source
- Utilizes a DPO format with explicit 'chosen' and 'rejected' response pairs
- Inverts the standard alignment objective by designating safety refusals as the 'rejected' class