Name: LLM Performance on EHR Tabular Tasks Across 32,950 Queries
Creator: Eyal Klang
Published: 2026-05-07T17:28:11
License: CC-BY-4.0
Keywords: Benchmarking, Llm Evaluation, Healthcare, Tabular Data, Tabular, Excel, Clinical Ehr, Synthetic

Description

32,950 model queries evaluated nine large language models on structured electronic health record tasks. The dataset, authored by Eyal Klang and last updated in May 2026, contains results from a study sampling 50,000 emergency department visits to test prompting strategies like direct, chain-of-thought, and tool-based code generation.

Use Cases

Benchmark LLM accuracy on structured EHR tasks based on the described evaluation of 9 models.
Compare prompting strategies (direct, CoT, tool-based) for tabular data extraction based on the study methodology.
Analyze the impact of table size (5-25 rows/columns) on LLM performance based on the described test combinations.
Investigate JSON format compliance and execution errors in LLM outputs for code generation tasks.

Strengths

Results are derived from a real-world sample of 50,000 emergency department visits.
Evaluation covers 32,950 model queries across 25 table size combinations.
Study compares nine LLMs and three distinct prompting strategies.

Limitations

Column-level documentation is absent; field semantics must be inferred after download.
Row count for the underlying results data is unknown, which may limit suitability assessment.
The dataset is very small (5.5 KB), indicating limited scope, likely containing summary results rather than raw query data.

Provenance

Source: figshare
Collection Method: Random sampling from a real-world dataset of emergency department visits, with model outputs validated against references.
Time Range: null
Freshness: Last updated 2026-05-07 17:28:11; freshness should be verified.
Geography: null

License is CC-BY-4.0, requiring attribution. File format is XLS, requiring compatible software.

Tabular Excel Benchmarking Llm Evaluation Healthcare Tabular Data Clinical Ehr Synthetic

LLM Performance on EHR Tabular Tasks Across 32,950 Queries

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info