Sign in to view source links and access this dataset
Description
A collection of 3,000 harmful instructions paired with jailbreak prompts, created for evaluating large language model safety. The dataset was constructed by deriving 1,000 seed instructions from AdvBench and MaliciousInstruct, bootstrapping 2,000 more via Self-Instruct, and manually screening 830 high-quality jailbreak prompts from AutoDAN. It was authored by openbmb and last updated on March 16, 2024.
Use Cases
Training safety classifiers based on the harmful instruction examples.
Evaluating LLM robustness against jailbreak attacks using the curated prompts.
Benchmarking the effectiveness of different safety alignment techniques.
Studying patterns in adversarial prompt generation for AI safety research.
Strengths
Contains 3,000 total harmful instruction and jailbreak prompt pairs.
Includes 830 manually screened, high-quality jailbreak prompts.
Derives from established safety benchmarks AdvBench and MaliciousInstruct.
Limitations
Column-level documentation is absent; field semantics must be inferred after download.
Row count is unknown, which may limit suitability assessment.
Data may reflect bias inherent to the source benchmarks and generation methods.
Provenance
Source
Derived from AdvBench and MaliciousInstruct; jailbreak prompts from AutoDAN.
Collection Method
Bootstrapped using Self-Instruct; manual screening of jailbreak prompts.
Freshness
Last updated 2024-03-16 13:25:54; freshness should be verified.
License is unknown; terms of use must be verified before application.