Name: Toxic Preference Data for Direct Preference Optimization
Creator: unalignment
Published: 2024-01-09T15:24:20
Keywords: Model Alignment, Machine Learning Safety, Text, Natural Language Processing, Preference Optimization

Description

Toxic-DPO v0.2 is a dataset created by 'unalignment' to illustrate the use of Direct Preference Optimization for de-aligning language models. It contains a collection of text examples labeled as toxic or harmful, including profanity. The dataset was uploaded to Hugging Face on January 9, 2024.

Use Cases

Training a reward model to identify toxic_language or harmful_content in text pairs.
Applying Direct Preference Optimization to fine-tune a language model using preference pairs containing profanity.
Analyzing the impact of a small number of toxic examples on model censorship and alignment.
Benchmarking the resilience of safety filters against adversarial training data featuring warnings and disclaimers.

Strengths

Dataset is explicitly designed for a specific machine learning technique (Direct Preference Optimization).
Examples contain contextual elements like warnings and disclaimers, providing editorialized text features.

Limitations

The dataset size, row count, and specific column structure are unknown.
Data is described as 'somewhat editorialized', which may introduce a consistent bias in the text examples.
Lack of sample data prevents assessment of example diversity or label consistency.

Provenance

Source: Hugging Face user 'unalignment'.
Collection Method: null
Time Range: null
Freshness: Last updated on January 9, 2024.
Geography: null

Usage requires explicit acknowledgment that the data contains toxic, harmful, and profane content. A full description and usage restrictions are available on the Hugging Face dataset page.

Text Model Alignment Machine Learning Safety Natural Language Processing Preference Optimization

Toxic Preference Data for Direct Preference Optimization

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info