Name: Human-Annotated Machine Translation Ranking Benchmark
Creator: double7
Published: 2026-02-26T13:09:20
Keywords: Size Categories1 Kn10 K, Arxiv260214028, Librarypolars, Languagezh, Languageen, Modalitytext, Librarymlcroissant, Librarydatasets, Librarypandas, Parquet, Regionus, Languagede

Description

Presenting a test-only benchmark collection for evaluating the intra-group ranking ability of machine translation reward models and metrics. It contains source sentences and small groups of candidate translations, each with associated human quality scores. The dataset was created by author double7 and last updated in February 2026.

Use Cases

Evaluate reward models on their ability to rank candidate translations within a group using human quality scores.
Benchmark machine translation metrics against human judgments for ranking tasks as studied in the GRRM paper.
Test the intra-group ranking performance of models on source sentences with 2-4 candidate translations.

Strengths

Human-derived quality scores provide a direct benchmark for model evaluation.
Specifically designed for evaluating intra-group ranking ability, a focused research task.
Last updated in February 2026, indicating recent maintenance.

Limitations

Dataset size, row count, and column structure are unknown, limiting assessment of statistical power.
The dataset is described as test-only, so it cannot be used for model training.
Specific details on source languages, translation domains, or annotator demographics are not provided.

Provenance

Source: huggingface, author double7
Collection Method: Human-annotated benchmark collection.
Freshness: Last updated 2026-02-27.

This is a test-only benchmark; it cannot be used for training models. Full description is available on an external dataset page.

Parquet Size Categories1 Kn10 K Arxiv260214028 Librarypolars Languagezh Languageen Modalitytext Librarymlcroissant Librarydatasets Librarypandas Regionus Languagede

Human-Annotated Machine Translation Ranking Benchmark

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info