Name: Performance of 5 LLMs vs. Junior Physicians in Emergency Internal Medicine Diagnosis
Creator: Jintao Wei
Published: 2026-05-08T05:56:41
License: CC-BY-4.0
Keywords: Emergency Medicine, Medical Llm, Benchmark, Healthcare, Tabular, Clinical Decision Support, Diagnostic Accuracy, Synthetic

Description

154 anonymized emergency internal medicine patient cases from a single hospital in early 2025 were used to evaluate the diagnostic performance of 5 large language models against 15 emergency department junior physicians. The study, authored by Jintao Wei and shared under a CC-BY-4.0 license, found models like DeepSeek-V3 achieved 90.0% main diagnostic accuracy, outperforming physicians. Results were published on figshare in May 2026.

Use Cases

Benchmarking LLM diagnostic accuracy against human clinicians based on real-world emergency cases.
Analyzing differential diagnosis comprehensiveness scores for different AI models and medical specialties.
Comparing AI and human response times for clinical decision-making in time-sensitive settings.

Strengths

Includes performance metrics for 5 specific LLMs (ChatGPT-4o, Gemini-2.0, Grok3, DeepSeek-V3, Doubao) and 15 junior physicians.
Reports specific accuracy percentages (e.g., 90.0% for DeepSeek-V3) and response times (e.g., 360.2 seconds for physicians).
Uses a dataset of 154 real-world, anonymized patient cases from a defined time period (January to May 2025).

Limitations

Row count and column-level documentation are absent; field semantics must be inferred after download.
Data is from a single-center retrospective study, which may limit generalizability.
The dataset is very small (16.9 KB), indicating limited scope, likely containing summary results rather than raw case data.

Provenance

Source: figshare, author Jintao Wei. The data likely originates from the Second Affiliated Hospital of Zhejiang University School of Medicine.
Collection Method: Single-center retrospective analysis of anonymized patient cases.
Time Range: Patient cases from January to May 2025.
Freshness: Last updated 2026-05-08 05:56:41; freshness should be verified.
Geography: Likely China, based on the hospital name.

Primary data file is a DOCX document, which may contain formatted text and tables rather than a machine-readable data table.

Tabular Emergency Medicine Medical Llm Benchmark Healthcare Clinical Decision Support Diagnostic Accuracy Synthetic

Performance of 5 LLMs vs. Junior Physicians in Emergency Internal Medicine Diagnosis

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info