Name: UltraSafety: 3,000 Harmful Instructions with Jailbreak Prompts for LLM Safety
Creator: openbmb
Published: 2024-03-15T07:15:48
Keywords: Ai Safety, Llm Evaluation, Text, Harmful Instructions, Jailbreak Prompts

Description

A collection of 3,000 harmful instructions paired with jailbreak prompts, created for evaluating large language model safety. The dataset was constructed by deriving 1,000 seed instructions from AdvBench and MaliciousInstruct, bootstrapping 2,000 more via Self-Instruct, and manually screening 830 high-quality jailbreak prompts from AutoDAN. It was authored by openbmb and last updated on March 16, 2024.

Use Cases

Training safety classifiers based on the harmful instruction examples.
Evaluating LLM robustness against jailbreak attacks using the curated prompts.
Benchmarking the effectiveness of different safety alignment techniques.
Studying patterns in adversarial prompt generation for AI safety research.

Strengths

Contains 3,000 total harmful instruction and jailbreak prompt pairs.
Includes 830 manually screened, high-quality jailbreak prompts.
Derives from established safety benchmarks AdvBench and MaliciousInstruct.

Limitations

Column-level documentation is absent; field semantics must be inferred after download.
Row count is unknown, which may limit suitability assessment.
Data may reflect bias inherent to the source benchmarks and generation methods.

Provenance

Source: Derived from AdvBench and MaliciousInstruct; jailbreak prompts from AutoDAN.
Collection Method: Bootstrapped using Self-Instruct; manual screening of jailbreak prompts.
Freshness: Last updated 2024-03-16 13:25:54; freshness should be verified.

License is unknown; terms of use must be verified before application.

Text Ai Safety Llm Evaluation Harmful Instructions Jailbreak Prompts

UltraSafety: 3,000 Harmful Instructions with Jailbreak Prompts for LLM Safety

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info