DataSalon

Discover quality datasets for AI training — aggregated from 40+ platforms, curated by AI.

ProductSearch Datasets Browse Topics Rankings Community API / MCP

ResourcesDocumentation Blog Changelog Status

LegalPrivacy Policy Terms of Service Cookie Policy

Toxic Preference Optimization Dataset for Model De-alignment | DataSalon

Home Mathematics & StatisticsToxic Preference Optimization Dataset for Model De-alignment

Mathematics & Statistics

Toxic Preference Optimization Dataset for Model De-alignment

Name: Toxic Preference Optimization Dataset for Model De-alignment
Creator: unalignment
Published: 2023-12-11T15:51:16
Keywords: Librarypolars, Size Categoriesn1 K, Modalitytext, Librarymlcroissant, Librarydatasets, Librarypandas, Licensecc By 40, Parquet, Regionus, Not For All Audiences

by unalignment·Updated 2y ago

Available on 1 platform

Description

2023-12-26 dataset from unalignment illustrates using direct preference optimization (DPO) to de-censor language models. It contains toxic and harmful text examples, many with attached warnings or disclaimers.

Use Cases

Train a DPO model on toxic text examples to study model de-alignment and censorship removal.
Analyze the frequency and structure of warnings and disclaimers attached to harmful content examples.
Benchmark safety filters against a dataset of profanity and harmful text for robustness testing.

Strengths

Created by unalignment, a known entity in AI alignment research.
Dataset is tagged as 'Not For All Audiences', indicating clear content warnings.
Last updated on 2023-12-26, providing a recent snapshot.

Limitations

Dataset size, row count, and column structure are unknown.
Content is described as 'somewhat editorialized' with warnings, potentially altering raw examples.
Limited to US region data, which may not reflect global linguistic patterns of toxicity.

Provenance

Source: huggingface
Collection Method: null
Time Range: null
Freshness: Last updated 2023-12-26.
Geography: US

Usage requires acknowledgment that data contains toxic/harmful content and profanity. License is listed as CC BY 4.0 but full terms are on the dataset page.

Parquet Librarypolars Size Categoriesn1 K Modalitytext Librarymlcroissant Librarydatasets Librarypandas Licensecc By 40 Regionus Not For All Audiences

Related Datasets

Quality Score

D36

Description

Source

Reputation

Quality Score

D36

Description

Source

Reputation

Access

Community

56 downloads

139 likes

0 views

Dataset Info

Author: unalignment
Created: Dec 11, 2023
Updated: Dec 26, 2023
Last synced: Jun 7, 2026

Access

Community

56 downloads

139 likes

0 views

Dataset Info

Author: unalignment
Created: Dec 11, 2023
Updated: Dec 26, 2023
Last synced: Jun 7, 2026

Toxic Preference Optimization Dataset for Model De-alignment

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info