Description

Fifteen high-fidelity cardiac surgery scenarios were developed by senior surgeons to benchmark five large language models, including O1 and GPT-4, using a 10-dimensional weighted evaluation framework. Median normalized scores for the top model, O1, reached 0.896, while patient safety and hallucination avoidance were the lowest-scoring dimensions across all models. A separate blinded evaluation by surgeons revealed a 7.57% shift in ratings from affirmative to negative after exposure to expert-curated reference answers.

Use Cases

Benchmarking LLM performance on complex clinical reasoning tasks based on expert-developed scenarios.
Studying human-AI collaboration dynamics based on the two-phase blinded evaluation protocol.
Analyzing model weaknesses in critical clinical dimensions like patient safety and hallucination avoidance.
Comparing multi-agent prompting strategies for LLMs in specialized medical domains.

Strengths

Evaluation framework is based on 15 expert-developed, high-fidelity cardiac surgery scenarios.
Performance metrics are derived from a blinded two-phase evaluation conducted by senior cardiac surgeons.
Dataset is licensed under CC-BY-4.0, facilitating open reuse.

Limitations

Column names and specific row counts are not provided, limiting precise understanding of data structure.
The dataset appears to be a document (DOCX/XLSX) rather than a structured data table, which may complicate direct analysis.
Multiple listings show conflicting file sizes (681348, 115375, 10494 bytes), suggesting possible versioning or format differences.

Provenance

Source: Marc Leon
Collection Method: Scenarios and reference answers were independently developed by a panel of senior cardiac surgeons; evaluations were conducted by a separate group of surgeons.
Freshness: Last updated 2026-05-29.

The primary data files are in DOCX and XLSX formats; the dataset likely contains textual evaluations, scores, and possibly synthetic model outputs rather than raw patient data.

Text Tabular Audio Excel Human Ai Collaboration Benchmark Llm Evaluation Healthcare Clinical Reasoning Cardiac Surgery Synthetic

Blinded Two-Phase Evaluation of Large Language Models in Cardiac Surgery

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info