Loading...
Loading...
Available on 1 platform
Sign in to view source links and access this dataset
A blinded two-phase evaluation of five large language models on 15 high-fidelity cardiac surgery reasoning tasks. The dataset contains normalized performance scores across 10 weighted evaluation dimensions, including scenario comprehension and patient safety, and tracks rating revisions by senior surgeons. It was authored by Marc Leon and last updated in May 2026.
License is CC-BY-4.0, requiring attribution.