Loading...
Loading...
Available on 1 platform
Sign in to view source links and access this dataset
IBM Research's 900K Judgements dataset contains approximately 900,000 pairwise comparison judgements from multiple LLM judges evaluating model responses. The data was collected for the paper 'Mediocrity is the key for LLM as a Judge Anchor Selection' to investigate anchor selection in LLM-as-a-judge evaluation. The dataset was last updated on March 18, -2026.
License is unknown; terms of use must be verified before application.