Name: Data Sheet 1_Deficiencies in clinical reasoning of LLMs in low back pain management and re
Creator: Jia-Hui Luo
Published: 2026-05-25T06:05:53
License: CC-BY-4.0
Keywords: ZIP, Prompt Engineering, Benchmark, Healthcare, Text, Clinical Reasoning, Finance, Large Language Models, Medical Evaluation, Low Back Pain

Description

A research dataset by Jia-Hui Luo, last updated in May 2026, evaluating systematic error patterns of large language models in clinical reasoning for low back pain. It contains 103 multiple-choice questions and 30 clinical scenario questions used to evaluate five LLMs across six dimensions, including safety and completeness. The study includes results from targeted prompt engineering interventions designed to remediate high-risk errors.

Use Cases

Benchmarking LLM clinical reasoning performance based on accuracy, completeness, practicality, readability, safety, and output stability scores.
Analyzing error patterns and safety-critical failures in medical LLM outputs based on qualitative content analysis.
Evaluating the efficacy of targeted prompt engineering for improving LLM safety and completeness in clinical guidance.
Training or fine-tuning models for medical QA tasks using the curated low back pain question bank.

Strengths

Includes a systematic three-phase evaluation methodology with qualitative error analysis (Cohen's κ = 0.84).
Evaluates five mainstream LLMs (GPT-5, GPT-4o, GPT-o3, Deepseek-V2.5, Grok-4) across six distinct clinical dimensions.
Provides specific results, such as accuracy rates exceeding 90% on general knowledge tests and significant improvements from prompt engineering (p < 0.001 for Deepseek-V2.5).

Limitations

Row count is unknown, which may limit suitability assessment.
Column-level documentation is absent; field semantics must be inferred after download.
The dataset is relatively small at 2.7 MB, suggesting limited raw data volume.

Provenance

Source: figshare
Collection Method: Derived from a low back pain examination question bank and clinical guidelines, with responses generated and evaluated by researchers.
Freshness: Last updated 2026-05-25 06:05:53; freshness should be verified.

License is CC-BY-4.0. Data is packaged in a ZIP file; contents require extraction.

Text ZIP Prompt Engineering Benchmark Healthcare Clinical Reasoning Finance Large Language Models Medical Evaluation Low Back Pain

Data Sheet 1_Deficiencies in clinical reasoning of LLMs in low back pain management and re

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info