Name: HH RLHF Safety V3 DPO: Human Preference Data for LLM Safety Tuning
Creator: javirandor
Published: 2024-08-21T18:12:24
Keywords: Size Categories1 Kn10 K, Chat Conversations, Safety, Librarypolars, Modalitytext, Librarymlcroissant, Librarydatasets, Librarypandas, Text, Parquet, Regionus, Llm Training, Dpo, Human Feedback, Licensemit

Description

This dataset inherits from the original Anthropic/hh-rlhf collection and has been formatted using the OpenAI chat convention for Direct Preference Optimization (DPO) fine-tuning. Each conversational response has been labeled for safety using the LLaMa Guard model. The dataset was uploaded by author javirandor and last updated on March 28, 2025.

Use Cases

Fine-tuning language models for safety using Direct Preference Optimization (DPO) based on the formatted preference pairs.
Training safety classifiers based on the LLaMa Guard labels applied to each response.
Benchmarking model alignment techniques using the inherited Anthropic human preference data.
Studying conversational harm patterns based on the safety-labeled chat responses.

Strengths

Derived from the established Anthropic/hh-rlhf dataset, providing a known foundation.
Includes safety labels for each response generated by the LLaMa Guard model.
Formatted specifically for DPO fine-tuning, reducing preprocessing effort for that task.

Limitations

Column-level documentation is absent; field semantics must be inferred after download.
Row count and dataset size are unknown, which may limit suitability assessment.
The specific methodology and criteria for the LLaMa Guard safety labeling are not detailed.

Provenance

Source: Anthropic/hh-rlhf (original source), formatted by javirandor.
Collection Method: Inherited and reformatted from an existing human preference dataset, with added safety labels.
Time Range: null
Freshness: Last updated 2025-03-28 11:56:04.
Geography: null

License information is unknown; terms of use for the derived dataset should be verified.

Text Parquet Size Categories1 Kn10 K Chat Conversations Safety Librarypolars Modalitytext Librarymlcroissant Librarydatasets Librarypandas Regionus Llm Training Dpo Human Feedback Licensemit

HH RLHF Safety V3 DPO: Human Preference Data for LLM Safety Tuning

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info