Name: Blinded Two-Phase Evaluation of Large Language Models in Cardiac Surgery
Creator: Marc Leon
Published: 2026-05-29T06:18:52
License: CC-BY-4.0
Keywords: Human Ai Collaboration, Benchmark, Healthcare, Text, Tabular, Audio, Blinded Evaluation, Clinical Reasoning, Cardiac Surgery, Large Language Models, Excel, Synthetic

Description

Fifteen high-fidelity cardiac surgery scenarios were used to evaluate five large language models via a blinded, two-phase framework involving senior surgeons. Median normalized scores across models ranged from 0.521 to 0.896, with scenario comprehension scoring highest and patient safety scoring lowest. The evaluation revealed a judgment shift, with 7.57% of ratings revised from affirmative to negative after surgeons reviewed expert reference answers.

Use Cases

Benchmarking LLM performance on complex clinical reasoning tasks based on scenario-specific scores.
Studying human-AI collaboration dynamics based on the two-phase blinded evaluation protocol.
Analyzing model weaknesses in critical dimensions like patient safety and hallucination avoidance based on the 10-dimensional evaluation framework.
Comparing reasoning outputs of specific LLMs (O1, O3-mini-high, DeepSeek-R1, GPT-4, Llama3-OpenBioLLM-70B) on surgical decision-making.

Strengths

Evaluation framework was developed and scored by a panel of senior cardiac surgeons, providing expert validation.
The two-phase blinded design allows for analysis of how expert reference answers influence human evaluator judgments.
Performance is assessed across a weighted 10-dimensional framework, including specific metrics like scenario comprehension and patient safety.

Limitations

Column names and a precise row count are not provided, limiting understanding of the dataset's structure.
Conflicting file size reports (48,073 vs. 11,720 bytes across entries) suggest potential metadata inconsistency.
The dataset is limited to 15 specific cardiac surgery scenarios, which may not generalize to other medical specialties.

Provenance

Source: Marc Leon.
Collection Method: A panel of senior cardiac surgeons independently developed scenarios and tasks; a separate group conducted the blinded evaluation.
Freshness: Last updated 2026-05-29.

License is CC-BY-4.0. Primary data files are in DOCX and XLSX formats.

Text Tabular Audio Excel Human Ai Collaboration Benchmark Healthcare Blinded Evaluation Clinical Reasoning Cardiac Surgery Large Language Models Synthetic

Blinded Two-Phase Evaluation of Large Language Models in Cardiac Surgery

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info