Loading...
Loading...
Available on 1 platform
Sign in to view source links and access this dataset
Five large language models were evaluated on 15 high-fidelity cardiac surgery scenarios by senior surgeons. O1 achieved the highest median normalized score (0.896), while patient safety and hallucination avoidance were the lowest-scoring dimensions across models. The dataset, authored by Marc Leon and last updated in May 2026, documents the evaluation framework and results, concluding that LLMs are not yet ready for safe use in complex surgical settings.
Data is provided as a DOCX document; users will need to parse the text to extract structured information.