Loading...
Loading...
Available on 1 platform
Sign in to view source links and access this dataset
MedMisBench evaluates whether large language models preserve correct medical judgment when misleading context is introduced. The benchmark is built from five medical question-answering sources spanning standard medical reasoning, expert reasoning, patient-journey scenarios, and agentic biomedical capability. It was created by HongjianZhou and last updated on June 15, 2026.
License is unknown; terms of use must be verified before application.