Loading...
Loading...
Available on 1 platform
Sign in to view source links and access this dataset
A dataset evaluating the accuracy of nine large language models on structured electronic health record administrative tasks. It contains results from 32,950 model queries across 25 table size combinations, using direct prompting, chain-of-thought reasoning, and tool-enabled code generation strategies. The dataset was authored by Eyal Klang and last updated on May 7, 2026.
License is CC-BY-4.0, requiring attribution.