Blinded Two-Phase Evaluation of LLMs in Cardiac Surgery Scenarios
by Marc Leon·Updated 7d ago
46.9 KB1files
Available on 1 platform
Sign in to view source links and access this dataset
Description
Marc Leon's dataset contains evaluation data from a study assessing large language model performance and human-AI collaboration in cardiac surgery. The dataset includes expert ratings from a blinded two-phase evaluation of five LLMs (O1, O3-mini-high, DeepSeek-R1, GPT-4, and Llama3-OpenBioLLM-70B) across 15 high-fidelity surgical scenarios. It supports analysis of model rankings, performance across 10 weighted evaluation dimensions, and shifts in expert judgment after reviewing reference answers.
Use Cases
Benchmarking LLM performance on complex clinical reasoning tasks based on expert-curated scenarios.
Analyzing human-AI collaboration patterns based on the two-phase blinded evaluation ratings.
Studying evaluator judgment shifts in medical contexts based on the optional rating revisions.
Comparing model strengths across evaluation dimensions like patient safety and hallucination avoidance based on the 10-dimensional framework.
Strengths
Evaluation framework is high-fidelity, developed by a panel of senior cardiac surgeons.
Includes specific performance metrics for five named LLMs, such as O1's median normalized score of 0.896.
Captures expert judgment shifts, with 7.57% of ratings revised from affirmative to negative.
Limitations
Column names and a precise row count are not provided, limiting understanding of the data structure.
The dataset size is reported inconsistently across platform entries (48073 and 11720).
The data is synthetic, generated for an evaluation study, and may not represent real patient cases.
Provenance
Source
Marc Leon
Collection Method
A panel of senior cardiac surgeons developed 15 scenarios; a separate group of senior surgeons conducted blinded evaluations.
Freshness
Last updated 2026-05-29.
License is CC-BY-4.0. The dataset appears to be an Excel file containing evaluation results.