Name: LLM Accuracy on EHR Tabular Tasks Across 32,950 Model Queries
Creator: Eyal Klang
Published: 2026-05-07T17:28:14
License: CC-BY-4.0
Keywords: Tabular Data Extraction, Llm Evaluation, Healthcare, Tabular, Prompting Strategies, Excel, Clinical Ehr, Synthetic

Description

A dataset evaluating the accuracy of nine large language models on structured electronic health record administrative tasks. It contains results from 32,950 model queries across 25 table size combinations, using direct prompting, chain-of-thought reasoning, and tool-enabled code generation strategies. The dataset was authored by Eyal Klang and last updated on May 7, 2026.

Use Cases

Benchmarking LLM performance on structured EHR tasks based on the described evaluation of nine models.
Analyzing the impact of prompting strategies (direct, CoT, tool-based) on accuracy based on the described methodology.
Studying the scalability of LLMs with increasing table size based on the described tests across 5–25 rows and columns.
Comparing model formatting and execution error rates based on the described assessment of JSON compliance and code execution.

Strengths

Results are based on a real-world sample of 50,000 emergency department visits.
Evaluation covers 32,950 model queries across nine LLMs and three prompting strategies.
Tasks were tested across 25 combinations of table sizes, providing scalability insights.

Limitations

Column-level documentation is absent; field semantics must be inferred after download.
Row count is unknown, which may limit suitability assessment.
The dataset is very small (5.5 KB), indicating it likely contains summary results, not the underlying raw data.

Provenance

Source: figshare
Collection Method: Random sampling from a real-world dataset of emergency department visits, with model outputs compared to validated references.
Time Range: null
Freshness: Last updated 2026-05-07 17:28:14; freshness should be verified.
Geography: null

License is CC-BY-4.0, requiring attribution.

Tabular Excel Tabular Data Extraction Llm Evaluation Healthcare Prompting Strategies Clinical Ehr Synthetic

LLM Accuracy on EHR Tabular Tasks Across 32,950 Model Queries

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info