Loading...
Loading...
Available on 1 platform
Sign in to view source links and access this dataset
Anthropic's HH-RLHF dataset contains between 100,000 and 1,000,000 human preference comparisons focused on model helpfulness and harmlessness, released in 2022. These text-based records are designed to facilitate the training of reward models for Reinforcement Learning from Human Feedback (RLHF) rather than supervised fine-tuning.
Users should be aware that training dialogue agents directly on this data via supervised learning is likely to lead to sub-optimal results; the dataset is specifically intended for preference/reward model training.