Name: Benchmarking LLM Responses to IBD Patient Questions, 20 Questions and 5 Models
Creator: Xiaoyue Wang
Published: 2026-04-10T05:57:45
License: CC-BY-4.0
Keywords: Readability Assessment, Inflammatory Bowel Disease, Benchmark, Healthcare, Tabular, Medical Benchmark, Transparency Evaluation, Large Language Models

Description

A cross-sectional benchmark study from January 17–24, 2026, evaluated five publicly available large language models on 20 patient-facing inflammatory bowel disease questions, producing 100 model–question responses. The dataset contains scores for informational quality, transparency proxies, and readability, assessed using DISCERN, EQIP, JAMA criteria, and six readability indices. The work was authored by Xiaoyue Wang and shared under a CC-BY-4.0 license.

Use Cases

Benchmarking the informational quality of LLM outputs for medical questions based on DISCERN and EQIP scores.
Evaluating transparency and disclosure in AI-generated health content based on JAMA benchmark criteria.
Assessing the readability of patient-facing AI responses based on six automated readability indices.
Comparing performance across different LLM models on a standardized set of clinical questions.

Strengths

Dataset is based on a structured benchmark using 20 guideline-derived questions across the IBD care pathway.
High interrater agreement reported for scoring, with ICC values ranging from 0.760 to 0.842 and weighted kappa up to 0.936.
All 10 measured outcomes showed statistically significant variation across models (Holm-adjusted P < 0.001).

Limitations

Row count and column-level documentation are absent; field semantics must be inferred after download.
The dataset is very small (50.1 KB), indicating limited scope and likely summary-level data.
The file format is DOCX, which may require conversion for programmatic analysis.

Provenance

Source: figshare
Collection Method: Queries were conducted via official LLM web interfaces under default settings, with responses evaluated by two blinded raters.
Time Range: Queries conducted from January 17–24, 2026.
Freshness: Last updated 2026-04-10 05:57:46; freshness should be verified.
Geography: null

Data is provided in a DOCX file format, which may not be directly machine-readable.

Tabular Readability Assessment Inflammatory Bowel Disease Benchmark Healthcare Medical Benchmark Transparency Evaluation Large Language Models

Benchmarking LLM Responses to IBD Patient Questions, 20 Questions and 5 Models

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info