LLM Performance Data for Systematic Review Extraction
by Takehiko Oami·Updated 1mo ago
1.1 MB1files
Available on 1 platform
Sign in to view source links and access this dataset
Description
A 1.1 MB document compares ChatGPT-4o, Claude 3 Sonnet, and Gemini 1.5 Pro for extracting data from sepsis trial PDFs. Takehiko Oami authored this study, which was uploaded to figshare on April 29, 2026. Mean no-error proportions for background data extraction ranged from 81.6% to 92.4%, while outcome extraction accuracy was lower, ranging from 27.8% to 80.7%.
Use Cases
Benchmarking LLM accuracy for clinical data extraction based on reported no-error proportions
Evaluating prompt engineering strategies like chain-of-thought and self-reflection
Analyzing inter-session consistency of LLM outputs across three sessions
Comparing processing times between standard and self-reflection prompts
Strengths
Performance metrics are provided for three specific LLMs (ChatGPT-4o, Claude 3 Sonnet, Gemini 1.5 Pro)
Results include processing times per article, ranging from 19.3 to 107.1 seconds
The study evaluates five specific clinical questions from the J-SSCG 2024 guidelines
Limitations
Column-level documentation is absent; field semantics must be inferred after download
Row count is unknown, which may limit suitability assessment
The dataset is a 1.1 MB DOCX file; the underlying data format and structure are not described
Provenance
Source
figshare
Collection Method
LLMs extracted predefined characteristics and outcomes from PDFs of eligible studies, with outputs assessed against a manual extraction reference standard.
Freshness
Last updated 2026-04-29 05:52:37
License is CC-BY-4.0. The primary data is contained within a DOCX document.