Name: LLM Performance and Human-AI Collaboration in 15 Cardiac Surgery Scenarios
Creator: Marc Leon
Published: 2026-05-29T06:18:53
License: CC-BY-4.0
Keywords: Human Ai Collaboration, Benchmark, Llm Evaluation, Healthcare, Tabular, Audio, Clinical Reasoning, Medical Ai, Cardiac Surgery, Excel, Synthetic

Description

A 2026 study by Marc Leon presents a two-phase evaluation of five large language models (O1, O3-mini-high, DeepSeek-R1, GPT-4, Llama3-OpenBioLLM-70B) on 15 expert-curated cardiac surgery scenarios. The dataset contains normalized performance scores across 10 weighted evaluation dimensions, including scenario comprehension, patient safety, and hallucination avoidance. It also documents rating shifts from a blinded evaluation by senior surgeons, revealing patterns of human-AI collaboration.

Use Cases

Benchmarking LLM performance on complex, open-ended clinical reasoning tasks based on the 15 cardiac surgery scenarios.
Analyzing human-AI collaboration patterns based on the documented rating shifts between blinded and unblinded evaluation phases.
Comparing model strengths and weaknesses across specific evaluation dimensions like patient safety and hallucination avoidance mentioned in the results.
Studying the phenomenon of 'overacceptance' where clinicians may over-trust incorrect AI-generated reasoning.

Strengths

Evaluation is based on 15 high-fidelity cardiac surgery scenarios developed independently by senior cardiac surgeons.
Includes performance scores for five specific LLMs (O1, O3-mini-high, DeepSeek-R1, GPT-4, Llama3-OpenBioLLM-70B) across 10 weighted dimensions.
Documents a two-phase blinded evaluation process, capturing 7.57% of ratings revised from affirmative to negative in the second round.

Limitations

Dataset is very small (10.7 KB); the specific row count and column-level data structure are unknown.
Column-level documentation is absent; field semantics must be inferred after download.
Data reflects a specific, controlled evaluation study; its generalizability to other clinical contexts may be limited.

Provenance

Source: figshare, authored by Marc Leon.
Collection Method: Data was generated from a blinded two-phase evaluation framework where senior surgeons rated LLM responses to 15 expert-curated scenarios.
Time Range: Study published in 2026.
Freshness: Last updated 2026-05-29 06:18:53.

License is CC-BY-4.0, requiring attribution.

Tabular Audio Excel Human Ai Collaboration Benchmark Llm Evaluation Healthcare Clinical Reasoning Medical Ai Cardiac Surgery Synthetic

LLM Performance and Human-AI Collaboration in 15 Cardiac Surgery Scenarios

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info