Name: LLM Performance Evaluation on 20 Sarcopenia Patient Queries
Creator: Tao Huang
Published: 2026-03-18T07:45:23
License: CC-BY-4.0
Keywords: Gemini, Sarcopenia, Chat Gpt, Benchmark, Healthcare, Patient Queries, Clinical Evaluation, Tabular, Deepseek, Large Language Models

Description

20 standardized patient-centered questions across six clinical domains were used to evaluate three large language models (Deepseek, ChatGPT, Gemini). Responses were graded for accuracy and comprehensiveness by three clinician researchers, with results including mean word counts and performance ratings. The dataset, authored by Tao Huang and last updated in March 2026, presents the foundational assessment findings.

Use Cases

Benchmarking LLM accuracy on clinical queries based on the four-point grading scale
Comparing response length and detail across models based on reported mean word counts
Analyzing model performance across clinical domains like risk factors and diagnosis mentioned in the results
Assessing response comprehensiveness using the five-point scale applied to 'Good' or higher-rated answers

Strengths

Data is derived from a structured evaluation by three clinician researchers, suggesting expert validation
Includes specific performance metrics like mean word counts (e.g., 583.75 ± 71.89 for Deepseek) and accuracy ratings
Covers six defined clinical domains for sarcopenia, providing a multi-faceted assessment

Limitations

Row count is unknown, which may limit suitability assessment
Column-level documentation is absent; field semantics must be inferred after download
The dataset is small at 225.5 KB, indicating limited scope

Provenance

Source: Tao Huang via figshare
Collection Method: A panel of sarcopenia clinician researchers developed questions, input them into LLMs, and independently assessed anonymized, randomized responses.
Time Range: The study period is not specified, but the dataset was updated in March 2026.
Freshness: Last updated 2026-03-18 07:45:23; freshness should be verified
Geography: Spatial coverage is not specified.

Primary data is contained in a DOCX file, which may require conversion for programmatic analysis.

Tabular Gemini Sarcopenia Chat Gpt Benchmark Healthcare Patient Queries Clinical Evaluation Deepseek Large Language Models

LLM Performance Evaluation on 20 Sarcopenia Patient Queries

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info