Name: LLM Performance and Human-AI Collaboration in 15 Cardiac Surgery Scenarios
Creator: Marc Leon
Published: 2026-05-29T06:18:53
License: CC-BY-4.0
Keywords: Human Ai Collaboration, Benchmark, Llm Evaluation, Healthcare, Tabular, Audio, Medical Ai, Cardiac Surgery, Excel, Synthetic

Description

A 2026 study by Marc Leon evaluates five large language models (O1, O3-mini-high, DeepSeek-R1, GPT-4, Llama3-OpenBioLLM-70B) on 15 high-fidelity cardiac surgery reasoning tasks. The dataset contains normalized performance scores across 10 evaluation dimensions, including scenario comprehension, patient safety, and hallucination avoidance, from a blinded two-phase evaluation by senior surgeons. It also records rating shifts between evaluation rounds, showing a 7.57% revision rate from affirmative to negative.

Use Cases

Benchmarking LLM performance on complex clinical reasoning tasks based on the 15 cardiac surgery scenarios.
Analyzing human-AI collaboration patterns based on the recorded rating shifts between the two blinded evaluation phases.
Comparing model strengths and weaknesses across evaluation dimensions like patient safety and hallucination avoidance mentioned in the results.
Studying the phenomenon of 'overacceptance' where clinicians may incorrectly accept flawed AI reasoning, as described in the conclusions.

Strengths

Includes performance scores for five specific LLMs (O1, O3-mini-high, DeepSeek-R1, GPT-4, Llama3-OpenBioLLM-70B) on 15 expert-developed scenarios.
Provides a 10-dimensional weighted evaluation framework with specific median scores, such as scenario comprehension (0.920) and patient safety (0.507).
Captures human evaluator judgment shifts, with 7.57% of ratings revised from affirmative to negative in the second round.

Limitations

Row count and column-level documentation are unknown, which may limit suitability assessment.
The dataset is small (10.6 KB), indicating limited scope, likely containing summary results rather than raw evaluation data.
Data is specific to cardiac surgery scenarios; generalizability to other medical domains is not assessed.

Provenance

Source: figshare, author Marc Leon.
Collection Method: A panel of senior cardiac surgeons developed 15 scenarios; five LLMs were evaluated using a multi-agent prompting strategy, with ratings from a separate group of senior surgeons in a blinded two-phase process.
Time Range: Study published in 2026.
Freshness: Last updated 2026-05-29 06:18:53.

License is CC-BY-4.0, requiring attribution. File format is XLSX, requiring compatible software.

Tabular Audio Excel Human Ai Collaboration Benchmark Llm Evaluation Healthcare Medical Ai Cardiac Surgery Synthetic

LLM Performance and Human-AI Collaboration in 15 Cardiac Surgery Scenarios

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info