Name: 900K Judgements: Large-Scale LLM-as-a-Judge Pairwise Evaluations
Creator: ibm-research
Published: 2026-03-17T11:50:19
Keywords: Task Categoriestext Generation, Librarypolars, Librarydask, Languageen, Pairwise Comparison, LLM-as-a-Judge, Text Generation, Modalitytext, Size Categories100 Kn1 M, Model Evaluation, Librarymlcroissant, Model Judgement, Librarydatasets, Benchmark, Llm Evaluation, Tabular, Parquet, Regionus, Large Scale, Task Categoriestext Classification

Description

IBM Research's 900K Judgements dataset contains approximately 900,000 pairwise comparison judgements from multiple LLM judges evaluating model responses. The data was collected for the paper 'Mediocrity is the key for LLM as a Judge Anchor Selection' to investigate anchor selection in LLM-as-a-judge evaluation. The dataset was last updated on March 18, -2026.

Use Cases

Benchmarking LLM evaluation consistency based on pairwise comparison data.
Studying the impact of anchor selection on LLM-as-a-judge outcomes as described in the associated paper.
Training or calibrating automated evaluation models using a large corpus of LLM judgements.

Strengths

Approximately 900,000 pairwise comparison judgements provide a substantial sample for analysis.
Data is associated with a specific, peer-reviewed research paper investigating anchor selection.

Limitations

Column-level documentation is absent; field semantics must be inferred after download.
Row count is unknown, which may limit suitability assessment.
Last updated 2026-03-18 09:48:30; freshness should be verified.

Provenance

Source: IBM Research
Collection Method: Collected from multiple LLM judges performing pairwise evaluations of model responses.
Time Range: null
Freshness: Last updated 2026-03-18 09:48:30.
Geography: null

License is unknown; terms of use must be verified before application.

900K Judgements: Large-Scale LLM-as-a-Judge Pairwise Evaluations

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info