Name: LLM Performance and Human-AI Collaboration in 15 Cardiac Surgery Scenarios
Creator: Marc Leon
Published: 2026-05-29T06:11:57
License: CC-BY-4.0
Keywords: Human Ai Collaboration, Benchmark, Healthcare, Clinical Evaluation, Text, Audio, Cardiac Surgery, Large Language Models, Synthetic

Description

A blinded two-phase evaluation assessed five large language models on 15 high-fidelity cardiac surgery reasoning tasks. Median normalized scores ranged from 0.521 to 0.896, with O1 achieving the highest score. The study, authored by Marc Leon and updated in May 2026, found that overacceptance of incorrect AI reasoning was a dominant collaboration imbalance.

Use Cases

Benchmarking LLM performance on complex, open-ended medical reasoning tasks based on the 15 clinical scenarios.
Studying human-AI collaboration dynamics in high-stakes settings based on the two-phase blinded evaluation protocol.
Analyzing model weaknesses in critical clinical dimensions like patient safety and hallucination avoidance based on the 10-dimensional evaluation framework.

Strengths

Evaluation framework includes 15 high-fidelity cardiac surgery scenarios developed by senior surgeons.
Performance metrics for five LLMs (O1, O3-mini-high, DeepSeek-R1, GPT-4, Llama3-OpenBioLLM-70B) are provided with median normalized scores.
Analysis includes a second-round evaluation showing 7.57% of ratings were revised from affirmative to negative.

Limitations

The dataset is a 46.9 KB DOCX file, suggesting a limited scope focused on study results rather than raw evaluation data.
Row count and column-level documentation are unknown, limiting suitability assessment for direct machine learning use.
Data may reflect bias inherent to the specific scenarios and evaluator panel described in the study.

Provenance

Source: Marc Leon via figshare.
Collection Method: A panel of senior cardiac surgeons developed scenarios; a separate group conducted a blinded two-phase evaluation of LLM outputs.
Freshness: Last updated 2026-05-29 06:11:57.

Data is provided as a DOCX document containing study results, not as a structured dataset (e.g., CSV). License is CC-BY-4.0.

Text Audio Human Ai Collaboration Benchmark Healthcare Clinical Evaluation Cardiac Surgery Large Language Models Synthetic

LLM Performance and Human-AI Collaboration in 15 Cardiac Surgery Scenarios

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info