Loading...
Loading...
Available on 1 platform
Sign in to view source links and access this dataset
Marc Leon's dataset contains a two-phase evaluation of five large language models (O1, O3-mini-high, DeepSeek-R1, GPT-4, and Llama3-OpenBioLLM-70B) on 15 high-fidelity cardiac surgery scenarios. The data includes expert-curated reference answers and a 10-dimensional weighted evaluation framework, capturing model performance and shifts in surgeon ratings after seeing reference answers. The dataset is designed to assess LLM capabilities and human-AI collaboration in complex surgical decision-making.
The primary file is an Excel spreadsheet ('Table 7_Blinded two-phase evaluation...xlsx'), which may require conversion for programmatic analysis. The dataset appears to be synthetic, containing model outputs and human evaluations rather than raw patient data.