Loading...
Loading...
Available on 1 platform
Sign in to view source links and access this dataset
A blinded two-phase evaluation assessed five large language models on 15 high-fidelity cardiac surgery reasoning tasks. Median normalized scores ranged from 0.521 to 0.896, with O1 achieving the highest score. The study, authored by Marc Leon and updated in May 2026, found that overacceptance of incorrect AI reasoning was a dominant collaboration imbalance.
Data is provided as a DOCX document containing study results, not as a structured dataset (e.g., CSV). License is CC-BY-4.0.