Name: Evaluating LLMs for Abstract Evaluation: An Empirical Study
Creator: Yinuo Liu
Published: 2026-05-04T05:33:56
License: CC-BY-4.0
Keywords: Benchmark, Peer Review, Ai Assessment, Tabular, Empirical Study, Large Language Models, Abstract Evaluation

Description

A 2026 study by Yinuo Liu compares the performance of three large language models (ChatGPT-5, Gemini-3-Pro, Claude-Sonnet-4.5) against human reviewers in evaluating 160 conference abstracts. The research assesses inter-rater reliability and systematic bias using statistical methods like intraclass correlation coefficients and Bland-Altman plots. The dataset, shared under a CC-BY-4.0 license, contains the results of this analysis in a 211.4 KB document.

Use Cases

Benchmarking LLM performance against human reviewers based on scoring patterns and reliability metrics described in the study.
Investigating systematic bias in AI-assisted abstract evaluation using the Bland-Altman analysis results.
Training or fine-tuning models for academic content assessment based on the rubric with eight criteria scored on a 1–5 scale.
Analyzing the consistency of different LLMs (ChatGPT, Gemini, Claude) on objective versus subjective evaluation criteria as outlined in the results.

Strengths

Includes comparative analysis of three prominent LLMs (ChatGPT-5, Gemini-3-Pro, Claude-Sonnet-4.5) against 14 human reviewers.
Based on a defined sample of 160 abstracts from a regional conference, evaluated using an eight-criteria rubric.
Provides specific statistical results, such as intraclass correlation coefficients ranging from 0.23 to 0.87 and mean differences from human ratings.

Limitations

The underlying data (abstracts, individual scores) is not directly accessible; only the analysis document is provided.
Row count and column-level documentation are unknown, limiting suitability assessment for direct reuse.
The dataset's small size (211.4 KB) indicates limited scope, containing primarily the study's analysis rather than raw evaluation data.

Provenance

Source: Yinuo Liu via figshare
Collection Method: Empirical study comparing LLM and human evaluations of conference abstracts.
Freshness: Last updated 2026-05-04 05:33:56

Primary data is embedded within a DOCX analysis document; raw tabular data is not provided in a separate, machine-readable format.

Tabular Benchmark Peer Review Ai Assessment Empirical Study Large Language Models Abstract Evaluation

Evaluating LLMs for Abstract Evaluation: An Empirical Study

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info