LLM Performance and Human-AI Collaboration in 15 Cardiac Surgery Scenarios
by Marc Leon·Updated 7d ago
10.7 KB1files
Available on 1 platform
Sign in to view source links and access this dataset
Description
A 2026 study by Marc Leon presents a two-phase evaluation of five large language models (O1, O3-mini-high, DeepSeek-R1, GPT-4, Llama3-OpenBioLLM-70B) on 15 expert-curated cardiac surgery scenarios. The dataset contains normalized performance scores across 10 weighted evaluation dimensions, including scenario comprehension, patient safety, and hallucination avoidance. It also documents rating shifts from a blinded evaluation by senior surgeons, revealing patterns of human-AI collaboration.
Use Cases
Benchmarking LLM performance on complex, open-ended clinical reasoning tasks based on the 15 cardiac surgery scenarios.
Analyzing human-AI collaboration patterns based on the documented rating shifts between blinded and unblinded evaluation phases.
Comparing model strengths and weaknesses across specific evaluation dimensions like patient safety and hallucination avoidance mentioned in the results.
Studying the phenomenon of 'overacceptance' where clinicians may over-trust incorrect AI-generated reasoning.
Strengths
Evaluation is based on 15 high-fidelity cardiac surgery scenarios developed independently by senior cardiac surgeons.
Includes performance scores for five specific LLMs (O1, O3-mini-high, DeepSeek-R1, GPT-4, Llama3-OpenBioLLM-70B) across 10 weighted dimensions.
Documents a two-phase blinded evaluation process, capturing 7.57% of ratings revised from affirmative to negative in the second round.
Limitations
Dataset is very small (10.7 KB); the specific row count and column-level data structure are unknown.
Column-level documentation is absent; field semantics must be inferred after download.
Data reflects a specific, controlled evaluation study; its generalizability to other clinical contexts may be limited.
Provenance
Source
figshare, authored by Marc Leon.
Collection Method
Data was generated from a blinded two-phase evaluation framework where senior surgeons rated LLM responses to 15 expert-curated scenarios.