Name: Blinded Two-Phase Evaluation of LLMs in Complex Cardiac Surgery
Creator: Marc Leon
Published: 2026-05-29T06:11:56
License: CC-BY-4.0
Keywords: Human Ai Collaboration, Benchmark, Healthcare, Clinical Evaluation, Text, Audio, Cardiac Surgery, Large Language Models, Synthetic

Description

Five large language models were evaluated on 15 high-fidelity cardiac surgery scenarios by senior surgeons. O1 achieved the highest median normalized score (0.896), while patient safety and hallucination avoidance were the lowest-scoring dimensions across models. The dataset, authored by Marc Leon and last updated in May 2026, documents the evaluation framework and results, concluding that LLMs are not yet ready for safe use in complex surgical settings.

Use Cases

Benchmarking LLM performance on complex clinical reasoning tasks based on the 15 scenario evaluations.
Studying human-AI collaboration patterns based on the documented shifts in surgeon ratings between blinded evaluation rounds.
Identifying critical limitations in AI for surgery based on the low performance scores in patient safety and hallucination avoidance dimensions.

Strengths

Evaluation framework developed by a panel of senior cardiac surgeons, suggesting high clinical relevance.
Includes performance scores for five representative LLMs across 10 weighted evaluation dimensions.
Documents a two-phase blinded evaluation process, capturing initial and revised human judgments.

Limitations

The dataset is a 665.4 KB DOCX file, which is a small document rather than a structured data table.
Column-level documentation is absent; field semantics must be inferred from the narrative description.
Row count is unknown, which may limit suitability assessment for quantitative analysis.

Provenance

Source: Marc Leon
Collection Method: A panel of senior cardiac surgeons developed scenarios; a separate group conducted blinded evaluations of prompted LLMs.
Freshness: Last updated 2026-05-29 06:11:56; freshness should be verified.

Data is provided as a DOCX document; users will need to parse the text to extract structured information.

Text Audio Human Ai Collaboration Benchmark Healthcare Clinical Evaluation Cardiac Surgery Large Language Models Synthetic

Blinded Two-Phase Evaluation of LLMs in Complex Cardiac Surgery

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info