Name: Kruskal–Wallis Tests for NLAT and DISCERN AI Chatbot Evaluations
Creator: Meisam Dastani
Published: 2026-05-11T17:38:25
License: CC-BY-4.0
Keywords: Benchmark, Healthcare, Tabular, Ai Chatbots, Clinical Assessment, Multiple sclerosis, Large Language Models, Excel, Medical Evaluation, Synthetic

Description

Meisam Dastani published a dataset on figshare containing statistical test results for evaluating four large language models on medical questions. The dataset includes Kruskal–Wallis test results from an evaluation of ChatGPT, Gemini, Copilot, and Grok using the DISCERN-AI and NLAT-AI assessment tools. The data was last updated on 2026-05-11.

Use Cases

Compare the statistical performance of AI chatbots based on DISCERN-AI and NLAT-AI scores.
Benchmark LLM responses for accuracy and transparency in the context of multiple sclerosis.
Assess model strengths and weaknesses across medical domains like diagnosis, treatment, and disease management.
Inform the selection of AI models for generating patient-facing medical content.
Validate assessment tools like DISCERN-AI and NLAT-AI for evaluating AI-generated medical text.

Strengths

Evaluation is based on 25 specific medical questions across five key domains.
Compares four prominent, publicly accessible LLMs: ChatGPT, Gemini, Copilot, and Grok.
Uses established assessment tools, DISCERN-AI and NLAT-AI, for structured evaluation.

Limitations

Column-level documentation is absent; field semantics must be inferred after download.
Row count is unknown, which may limit suitability assessment.
The dataset is very small at 5.5 KB, indicating limited scope.

Provenance

Source: Meisam Dastani via figshare.
Collection Method: Responses from four LLMs to 25 medical questions were evaluated using DISCERN-AI and NLAT-AI tools.
Freshness: Last updated 2026-05-11 17:38:25; freshness should be verified.

Data is provided in XLS format, requiring software like Microsoft Excel or a compatible spreadsheet tool to open.

Tabular Excel Benchmark Healthcare Ai Chatbots Clinical Assessment Multiple sclerosis Large Language Models Medical Evaluation Synthetic

Kruskal–Wallis Tests for NLAT and DISCERN AI Chatbot Evaluations

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info