Name: LLM Hallucination Detection Agreement Metrics with Confidence Intervals
Creator: Callum Hill
Published: 2026-04-03T17:27:42
License: CC-BY-4.0

Description

A 9.5 KB Excel file contains statistical agreement metrics for evaluating large language model outputs against a human-adjudicated panel. Author Callum Hill published the dataset in April 2026. It reports mean values with 95% bootstrap confidence intervals for six inter-rater reliability and classification metrics.

Use Cases

Compare Cohen's κ and Gwet's AC1 values to assess inter-rater reliability between LLM and human analysts.
Analyze Sensitivity and Specificity metrics to evaluate the classification performance of an LLM in detecting hallucinations.
Use the reported Jaccard index to measure the similarity between LLM-generated and human-coded thematic segments.
Benchmark overall Agreement (%) against a reference-standard human panel for model validation.

Strengths

All six key metrics (Agreement, Cohen's κ, Gwet's AC1, Jaccard index, Sensitivity, Specificity) are reported with 95% bootstrap confidence intervals (B=1000).
Dataset is openly licensed under CC-BY-4.0.
Data is structured in a single, small (9.5 KB) Excel file for straightforward access.

Limitations

The dataset is extremely small at 9.5 KB, indicating it contains summary statistics rather than raw evaluation data.
Sample size (number of coded segments or evaluations) underlying the metrics is unknown, limiting reproducibility.
No raw columns or sample data are provided, only aggregated results.

Provenance

Source: figshare, uploaded by author Callum Hill.
Collection Method: Derived from a study comparing LLM outputs to a reference-standard human adjudicated panel, using blinded human analysis and expert consensus adjudication.
Freshness: Last updated on April 3, 2026.

Data is in XLS (Excel) format. The platform tags suggest the context involves 'Context Complicate Coding' and 'Define Llm Hallucination', indicating the metrics are for a specific LLM evaluation task in qualitative analysis.

LLM Hallucination Detection Agreement Metrics with Confidence Intervals

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info