Name: RankJudge: Benchmark for Evaluating LLM Judges on Multi-Turn Conversations
Creator: Layer6
Published: 2026-05-20T19:59:47
Keywords: Text Pairs, Benchmark, Llm Evaluation, Healthcare, Text, Conversation Quality, Ai Judges

Description

RankJudge is a benchmark dataset for evaluating Large Language Model judges on multi-turn conversation quality. It contains 652 pairs of conversations—one good and one with a single injected weakness—grounded in source documents like computer science papers, medical papers, or 10-K filings. The dataset was created by Layer6 and was last updated on May 20, 2026.

Use Cases

Benchmarking LLM judge performance based on the task of identifying the 'bad' conversation in a pair.
Training or fine-tuning models for conversation quality assessment using the provided ground-truth verdicts.
Analyzing failure modes of LLM judges across different weakness types injected into conversations.
Researching the alignment of LLM judgments with human preferences in multi-turn dialogue grounded in technical documents.

Strengths

Provides a structured benchmark with 652 conversation pairs and known ground-truth verdicts.
Conversations are grounded in specific source document types (CS papers, medical papers, 10-K filings), adding contextual realism.
Includes a 'matches' subset with 13,692 rows, likely representing individual LLM judge predictions for detailed analysis.

Limitations

Column-level documentation is absent; field semantics must be inferred after download.
Row count for the primary dataset is unknown, which may limit suitability assessment.
Data may reflect bias inherent to the specific source documents and conversation construction methods used.

Provenance

Source: Layer6 via Hugging Face.
Collection Method: Conversation pairs were constructed by injecting a single weakness into a 'good' conversation, with ground truth known.
Freshness: Last updated 2026-05-20 20:06:42; freshness should be verified.

License information is unknown and should be verified before use.

Text Text Pairs Benchmark Llm Evaluation Healthcare Conversation Quality Ai Judges

RankJudge: Benchmark for Evaluating LLM Judges on Multi-Turn Conversations

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info