Name: LLM Performance and Human-AI Collaboration in 15 Cardiac Surgery Scenarios
Creator: Marc Leon
Published: 2026-05-29T06:11:59
License: CC-BY-4.0
Keywords: Human Ai Collaboration, Benchmark, Llm Evaluation, Healthcare, Tabular, Audio, Medical Ai, Clinical Decision Making, Cardiac Surgery, Excel, Synthetic

Description

A 2026 study by Marc Leon presents a blinded two-phase evaluation of five large language models on 15 high-fidelity cardiac surgery scenarios. The dataset contains normalized performance scores for models including O1, O3-mini-high, DeepSeek-R1, GPT-4, and Llama3-OpenBioLLM-70B across 10 weighted evaluation dimensions. Results show performance variation and highlight a collaboration imbalance where clinicians over-accepted incorrect model reasoning.

Use Cases

Benchmarking LLM performance on complex, open-ended medical reasoning tasks based on the 15 clinical scenarios.
Analyzing human-AI collaboration patterns based on the two-phase evaluation framework and rating revision data.
Studying model weaknesses in critical clinical dimensions like patient safety and hallucination avoidance based on the 10-dimensional evaluation scores.

Strengths

Evaluation framework developed by a panel of senior cardiac surgeons for 15 high-fidelity scenarios.
Contains specific median normalized scores for five representative LLMs, with O1 scoring 0.896.
Documents a 7.57% shift in ratings from affirmative to negative in the second evaluation round, indicating overacceptance.

Limitations

Dataset is very small at 10.4 KB; the scope is limited to 15 specific scenarios.
Column-level documentation is absent; field semantics must be inferred after download.
Row count is unknown, which may limit suitability assessment for large-scale analysis.

Provenance

Source: Marc Leon, published on figshare.
Collection Method: A panel of senior cardiac surgeons developed scenarios; a separate group conducted a blinded two-phase evaluation of prompted LLMs.
Freshness: Last updated 2026-05-29 06:11:59; freshness should be verified.

License is CC-BY-4.0, requiring attribution.

Tabular Audio Excel Human Ai Collaboration Benchmark Llm Evaluation Healthcare Medical Ai Clinical Decision Making Cardiac Surgery Synthetic

LLM Performance and Human-AI Collaboration in 15 Cardiac Surgery Scenarios

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info