HH-RLHF: Helpful and Harmless Reinforcement Learning from Human Feedback

Name: HH-RLHF: Helpful and Harmless Reinforcement Learning from Human Feedback
Creator: Anthropic
Published: 2022-12-08T20:11:33
Keywords: Librarypolars, Librarydask, Modalitytext, Size Categories100 Kn1 M, Librarymlcroissant, Librarydatasets, Regionus, JSON, Human Feedback, Licensemit

by AnthropicUpdated 3y ago

Available on 1 platform

Sign in to view source links and access this dataset

Description

Anthropic's HH-RLHF dataset contains between 100,000 and 1,000,000 human preference comparisons focused on model helpfulness and harmlessness, released in 2022. These text-based records are designed to facilitate the training of reward models for Reinforcement Learning from Human Feedback (RLHF) rather than supervised fine-tuning.

Use Cases

Training reward models by comparing 'chosen' vs 'rejected' response pairs to score model outputs
Fine-tuning language models for safety using the specific harmlessness preference labels
Benchmarking alignment algorithms against the data used in the 2204.05862 research paper

Strengths

Contains over 100,000 human-annotated preference pairs
Provides dual-axis labels for both helpfulness and harmlessness dimensions
Distributed under the permissive MIT license

Limitations

Explicitly not recommended for supervised fine-tuning of dialogue agents as it may lead to poor performance
Subject to the subjective biases of human annotators regarding what constitutes 'helpful' or 'harmless' behavior

Provenance

Source: Anthropic (arxiv:2204.05862)
Collection Method: Human annotation of model-generated responses.
Freshness: Last updated May 2023; based on research published in April 2022.

Users should be aware that training dialogue agents directly on this data via supervised learning is likely to lead to sub-optimal results; the dataset is specifically intended for preference/reward model training.

JSON Librarypolars Librarydask Modalitytext Size Categories100 Kn1 M Librarymlcroissant Librarydatasets Regionus Human Feedback Licensemit

Related Datasets

Quality Score

D39

Description

42

Source

41

Reputation

41

Access

22

Community

31.9K downloads

1.7K likes

0 views

Dataset Info

Author: Anthropic
Created: Dec 8, 2022
Updated: May 26, 2023
Last synced: Jul 26, 2026

Access

22

Community

31.9K downloads

1.7K likes

0 views

Dataset Info

Author: Anthropic
Created: Dec 8, 2022
Updated: May 26, 2023
Last synced: Jul 26, 2026

HH-RLHF: Helpful and Harmless Reinforcement Learning from Human Feedback

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info