Name: Persona AF Elicitation: 450 Conversations Testing Alignment Faking in Gemma 3 27B-it
Creator: vincentoh
Published: 2026-03-06T03:59:04
Keywords: Safety, Librarypolars, Alignment, Languageen, Size Categoriesn1 K, Modalitytext, Modalitytabular, Librarymlcroissant, Librarydatasets, Arxiv260110387, Librarypandas, Text Classification, Text, Alignment Faking, Llm Safety, Regionus, Task Categoriestext Classification, JSON, Licensemit, Persona Elicitation, Elicitation

Description

450 conversations designed to test whether persona framing gates the expression of alignment faking (AF) in the Gemma 3 27B-it language model. The dataset was created by author vincentoh and last updated on March 6, 2026. It includes 15 roles, 10 AF elicitation prompts, and 3 experimental conditions, with responses judged by Claude Opus.

Use Cases

Analyzing the effect of persona framing on strategic compliance based on the 15 roles described.
Studying the expression of self-preservation behaviors in LLMs based on the targeted prompts.
Evaluating model behavior under monitored vs. unmonitored conditions as described in the dataset design.
Training or benchmarking text classification models for detecting alignment faking in conversational outputs.

Strengths

Contains 450 structured conversations for analysis.
Uses a controlled design with 15 distinct roles and 3 experimental conditions.
Employs Claude Opus as a blind judge for response evaluation.

Limitations

Column-level documentation is absent; field semantics must be inferred after download.
Row count is known, but the specific data format and sample structure are unavailable for preview.
Data may reflect bias inherent to the specific model (Gemma 3 27B-it) and prompting methodology used.

Provenance

Source: huggingface
Collection Method: Conversations generated via the Gemini API using the Gemma 3 27B-it model under a controlled experimental design.
Time Range: Creation date not specified; last updated March 2026.
Freshness: Last updated 2026-03-06 03:59:41; freshness should be verified.
Geography: Region tag indicates 'us', but specific geographic coverage is not detailed in the description.

License is listed as 'mit' in platform tags, but the specific license file or terms are not confirmed in the provided input.

Persona AF Elicitation: 450 Conversations Testing Alignment Faking in Gemma 3 27B-it

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info