Loading...
Loading...
Available on 1 platform
Sign in to view source links and access this dataset
RankJudge is a benchmark dataset for evaluating Large Language Model judges on multi-turn conversation quality. It contains 652 pairs of conversations—one good and one with a single injected weakness—grounded in source documents like computer science papers, medical papers, or 10-K filings. The dataset was created by Layer6 and was last updated on May 20, 2026.
License information is unknown and should be verified before use.