Name: LLM Performance and Human-AI Collaboration in 15 Cardiac Surgery Scenarios
Creator: Marc Leon
Published: 2026-05-29T06:11:57
License: CC-BY-4.0
Keywords: Human Ai Collaboration, Benchmark, Llm Evaluation, Healthcare, Tabular, Audio, Clinical Reasoning, Medical Ai, Cardiac Surgery, Excel, Synthetic

Description

15 high-fidelity cardiac surgery scenarios were used to evaluate five large language models on a 10-dimensional weighted framework. Median normalized scores ranged from 0.521 for Llama3-OpenBioLLM-70B to 0.896 for O1, with scenario comprehension scoring highest and patient safety lowest. The dataset, created by Marc Leon and published on figshare in 2026, captures a blinded two-phase evaluation where surgeons revised 7.57% of ratings from affirmative to negative after seeing reference answers.

Use Cases

Benchmarking LLM performance on complex, open-ended clinical reasoning tasks based on the 10-dimensional evaluation framework.
Analyzing patterns of human overacceptance of AI-generated reasoning based on the two-phase evaluation results.
Comparing task-specific performance across different LLM architectures (e.g., O1, GPT-4, DeepSeek-R1) for surgical decision-making.
Studying collaboration imbalances in human-AI teams based on the shift in surgeon ratings between evaluation rounds.

Strengths

Evaluation is based on 15 independently developed, high-fidelity cardiac surgery scenarios.
Includes performance scores for five representative LLMs (O1, O3-mini-high, DeepSeek-R1, GPT-4, Llama3-OpenBioLLM-70B) across 10 weighted dimensions.
Captures human evaluator judgment shifts, with 7.57% of ratings revised from affirmative to negative in the second round.

Limitations

Row count and column-level documentation are unknown, which limits suitability assessment and requires field semantics to be inferred after download.
The dataset is very small at 10.7 KB, indicating a limited scope focused on evaluation results rather than raw interaction data.
Data may reflect bias inherent to the specific scenarios and expert panel used in the study.

Provenance

Source: figshare
Collection Method: A panel of senior cardiac surgeons developed scenarios; a separate group conducted a blinded two-phase evaluation of LLM responses.
Freshness: Last updated 2026-05-29 06:11:57; freshness should be verified.

License is CC-BY-4.0, requiring attribution.

Tabular Audio Excel Human Ai Collaboration Benchmark Llm Evaluation Healthcare Clinical Reasoning Medical Ai Cardiac Surgery Synthetic

LLM Performance and Human-AI Collaboration in 15 Cardiac Surgery Scenarios

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info