Loading...
Loading...
Available on 1 platform
Sign in to view source links and access this dataset
A 2026 study by Marc Leon presents a blinded two-phase evaluation of five large language models on 15 high-fidelity cardiac surgery scenarios. The dataset contains normalized performance scores for models including O1, O3-mini-high, DeepSeek-R1, GPT-4, and Llama3-OpenBioLLM-70B across 10 weighted evaluation dimensions. Results show performance variation and highlight a collaboration imbalance where clinicians over-accepted incorrect model reasoning.
License is CC-BY-4.0, requiring attribution.