LLM Performance and Human-AI Collaboration in 15 Cardiac Surgery Scenarios
by Marc Leon·Updated 7d ago
10.6 KB1files
Available on 1 platform
Sign in to view source links and access this dataset
Description
A 2026 study by Marc Leon evaluates five large language models (O1, O3-mini-high, DeepSeek-R1, GPT-4, Llama3-OpenBioLLM-70B) on 15 high-fidelity cardiac surgery reasoning tasks. The dataset contains normalized performance scores across 10 evaluation dimensions, including scenario comprehension, patient safety, and hallucination avoidance, from a blinded two-phase evaluation by senior surgeons. It also records rating shifts between evaluation rounds, showing a 7.57% revision rate from affirmative to negative.
Use Cases
Benchmarking LLM performance on complex clinical reasoning tasks based on the 15 cardiac surgery scenarios.
Analyzing human-AI collaboration patterns based on the recorded rating shifts between the two blinded evaluation phases.
Comparing model strengths and weaknesses across evaluation dimensions like patient safety and hallucination avoidance mentioned in the results.
Studying the phenomenon of 'overacceptance' where clinicians may incorrectly accept flawed AI reasoning, as described in the conclusions.
Strengths
Includes performance scores for five specific LLMs (O1, O3-mini-high, DeepSeek-R1, GPT-4, Llama3-OpenBioLLM-70B) on 15 expert-developed scenarios.
Provides a 10-dimensional weighted evaluation framework with specific median scores, such as scenario comprehension (0.920) and patient safety (0.507).
Captures human evaluator judgment shifts, with 7.57% of ratings revised from affirmative to negative in the second round.
Limitations
Row count and column-level documentation are unknown, which may limit suitability assessment.
The dataset is small (10.6 KB), indicating limited scope, likely containing summary results rather than raw evaluation data.
Data is specific to cardiac surgery scenarios; generalizability to other medical domains is not assessed.
Provenance
Source
figshare, author Marc Leon.
Collection Method
A panel of senior cardiac surgeons developed 15 scenarios; five LLMs were evaluated using a multi-agent prompting strategy, with ratings from a separate group of senior surgeons in a blinded two-phase process.
Time Range
Study published in 2026.
Freshness
Last updated 2026-05-29 06:18:53.
License is CC-BY-4.0, requiring attribution. File format is XLSX, requiring compatible software.