Loading...
Loading...
Available on 1 platform
Sign in to view source links and access this dataset
Five large language models were evaluated on 15 high-fidelity cardiac surgery scenarios by senior surgeons using a 10-dimensional weighted framework. Median normalized scores ranged from 0.521 for Llama3-OpenBioLLM-70B to 0.896 for O1, with scenario comprehension scoring highest and patient safety lowest. The dataset, created by Marc Leon and last updated in May 2026, captures model performance and evaluator judgment shifts in a blinded two-phase study.
License is CC-BY-4.0, requiring attribution.