Name: LLM Performance and Human-AI Collaboration in 15 Cardiac Surgery Scenarios
Creator: Marc Leon
Published: 2026-05-29T06:11:58
License: CC-BY-4.0
Keywords: Human Ai Collaboration, Benchmark, Llm Evaluation, Healthcare, Tabular, Audio, Medical Ai, Surgical Decision Making, Cardiac Surgery, Excel, Synthetic

Description

A blinded two-phase evaluation of five large language models on 15 high-fidelity cardiac surgery reasoning tasks. The dataset contains normalized performance scores across 10 weighted evaluation dimensions, including scenario comprehension and patient safety, and tracks rating revisions by senior surgeons. It was authored by Marc Leon and last updated in May 2026.

Use Cases

Benchmarking LLM performance on complex clinical reasoning tasks based on the 10-dimensional evaluation framework.
Analyzing human-AI collaboration patterns, such as overacceptance, based on the two-phase rating revision data.
Comparing model strengths and weaknesses across specific evaluation dimensions like patient safety and hallucination avoidance.
Studying the stability of model rankings across different high-fidelity surgical scenarios.

Strengths

Includes performance data for five specific LLMs (O1, O3-mini-high, DeepSeek-R1, GPT-4, Llama3-OpenBioLLM-70B) with median normalized scores.
Evaluation is based on 15 expert-developed cardiac surgery scenarios and a 10-dimensional weighted framework.
Captures human evaluator judgment shifts between two blinded rating phases, with 7.57% of ratings revised from affirmative to negative.

Limitations

Row count and column-level documentation are unknown; field semantics must be inferred after download.
The dataset is very small (10.2 KB), indicating limited scope, likely containing summary results rather than raw evaluation data.

Provenance

Source: figshare
Collection Method: A panel of senior cardiac surgeons developed scenarios; a separate group conducted a blinded two-phase evaluation of LLM outputs.
Freshness: Last updated 2026-05-29 06:11:58

License is CC-BY-4.0, requiring attribution.

Tabular Audio Excel Human Ai Collaboration Benchmark Llm Evaluation Healthcare Medical Ai Surgical Decision Making Cardiac Surgery Synthetic

LLM Performance and Human-AI Collaboration in 15 Cardiac Surgery Scenarios

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info