Loading...
Loading...
Available on 1 platform
Sign in to view source links and access this dataset
15 high-fidelity cardiac surgery scenarios were used to evaluate five large language models on a 10-dimensional weighted framework. Median normalized scores ranged from 0.521 for Llama3-OpenBioLLM-70B to 0.896 for O1, with scenario comprehension scoring highest and patient safety lowest. The dataset, created by Marc Leon and published on figshare in 2026, captures a blinded two-phase evaluation where surgeons revised 7.57% of ratings from affirmative to negative after seeing reference answers.
License is CC-BY-4.0, requiring attribution.