Name: LLM Performance and Human-AI Collaboration in 15 Cardiac Surgery Scenarios
Creator: Marc Leon
Published: 2026-05-29T06:11:57
License: CC-BY-4.0
Keywords: Human Ai Collaboration, Benchmark, Llm Evaluation, Healthcare, Tabular, Audio, Clinical Reasoning, Medical Ai, Cardiac Surgery, Excel, Synthetic

Description

Five large language models were evaluated on 15 high-fidelity cardiac surgery scenarios by senior surgeons using a 10-dimensional weighted framework. Median normalized scores ranged from 0.521 for Llama3-OpenBioLLM-70B to 0.896 for O1, with scenario comprehension scoring highest and patient safety lowest. The dataset, created by Marc Leon and last updated in May 2026, captures model performance and evaluator judgment shifts in a blinded two-phase study.

Use Cases

Benchmark LLM performance on complex surgical reasoning tasks based on the 10-dimensional evaluation framework.
Analyze human-AI collaboration patterns based on the documented rating shifts between blinded and unblinded evaluation phases.
Compare the clinical reasoning capabilities of different LLM architectures based on the scenario-specific performance scores.
Study overacceptance of AI-generated reasoning in clinical settings based on the reported imbalance in human-AI collaboration.

Strengths

Evaluation is based on 15 high-fidelity cardiac surgery scenarios developed by a panel of senior cardiac surgeons.
Performance is measured using a 10-dimensional weighted evaluation framework, providing multi-faceted scores.
Captures human-AI collaboration dynamics through a two-phase blinded evaluation with documented rating revisions.

Limitations

Column-level documentation is absent; field semantics must be inferred after download.
Row count is unknown, which may limit suitability assessment.
The dataset is very small at 10.6 KB, indicating limited scope.

Provenance

Source: figshare
Collection Method: A panel of senior cardiac surgeons developed scenarios; a separate group conducted a blinded two-phase evaluation of prompted LLMs.
Freshness: Last updated 2026-05-29 06:11:57; freshness should be verified.

License is CC-BY-4.0, requiring attribution.

Tabular Audio Excel Human Ai Collaboration Benchmark Llm Evaluation Healthcare Clinical Reasoning Medical Ai Cardiac Surgery Synthetic

LLM Performance and Human-AI Collaboration in 15 Cardiac Surgery Scenarios

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info