Name: Safe RLHF: Human Preference Data for Constrained AI Value Alignment
Creator: PKU-Alignment
Published: 2023-05-15T11:47:08
License: Apache-2.0
Keywords: Alpaca, Safety, Rlhf, Ai Safety, Safe Reinforcement Learning, Llama, Gpt, Transformers, Deepspeed, Safe Reinforcement Learning From Human Feedback, Large Language Model, Reinforcement Learning, Transformer, Llms, Large Language Models, Reinforcement Learning From Human Feedback, Vicuna, Beaver, Safe Rlhf

Description

PKU-Alignment developed this dataset to facilitate Constrained Value Alignment through Safe Reinforcement Learning from Human Feedback (Safe RLHF). It provides human-annotated preference data for Large Language Models, specifically targeting the balance between helpfulness and safety constraints as of late 2024.

Use Cases

Training safety reward models to score model responses based on safety preference labels
Fine-tuning LLMs using Safe RLHF to minimize harmful outputs while maintaining helpfulness
Evaluating the safety alignment of models like Llama and Vicuna against human-annotated benchmarks

Strengths

Released under the permissive Apache-2.0 license
Developed by the specialized PKU-Alignment research group
Targets the specific technical niche of constrained optimization in RLHF

Limitations

Unknown sample size and row count in the primary metadata
Safety definitions are subject to the specific cultural and ethical guidelines of the annotator pool
Potential for label noise typical of human-in-the-loop feedback datasets

Provenance

Source: PKU-Alignment
Collection Method: Human annotation and preference ranking
Freshness: Updated as of November 2024.

Users should refer to the PKU-Alignment GitHub repository for specific data loading scripts and implementation details related to the 'Beaver' model series.

Alpaca Safety Rlhf Ai Safety Safe Reinforcement Learning Llama Gpt Transformers Deepspeed Safe Reinforcement Learning From Human Feedback Large Language Model Reinforcement Learning Transformer Llms Large Language Models Reinforcement Learning From Human Feedback Vicuna Beaver Safe Rlhf

Safe RLHF: Human Preference Data for Constrained AI Value Alignment

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info