Loading...
Loading...
Available on 1 platform
Sign in to view source links and access this dataset
Presenting a test-only benchmark collection for evaluating the intra-group ranking ability of machine translation reward models and metrics. It contains source sentences and small groups of candidate translations, each with associated human quality scores. The dataset was created by author double7 and last updated in February 2026.
This is a test-only benchmark; it cannot be used for training models. Full description is available on an external dataset page.