RationaleRM: 10K-100K Samples for Aligning Reward Model Reasoning

Name: RationaleRM: 10K-100K Samples for Aligning Reward Model Reasoning
Creator: Qwen
Published: 2026-02-02T14:16:10
Keywords: Size Categories10 Kn100 K, Task Categoriesquestion Answering, Languageen, Licensecc By 40, Regionus, Llm As Judge, Task Categoriestext Classification, Arxiv260204649, Metajudge, Rationale Consistency, Reward Model

by QwenUpdated 4mo ago

Available on 1 platform

Sign in to view source links and access this dataset

Description

RationaleRM provides between 10,000 and 100,000 records designed to align the reasoning processes of reward models with human judgments, released by Qwen in February 2026. The dataset focuses on rationale consistency to distinguish frontier models and detect deceptive alignment in text classification and question-answering tasks. It serves as a benchmark for evaluating whether a model's internal logic matches its final output.

Use Cases

Training reward models to prioritize rationale-consistency over simple outcome accuracy
Implementing llm-as-judge workflows to evaluate the logical steps of model responses
Researching deceptive alignment by analyzing discrepancies between model rationales and final outputs

Strengths

Scale of 10,000 to 100,000 records
Includes rationale-consistency metrics for detecting deceptive alignment
Linked to Arxiv research 260204649 for theoretical grounding

Limitations

Limited to English language tasks
Geographic bias toward US-centric data as indicated by region tags
Lack of detailed column-level documentation in the metadata

Provenance

Source: Qwen (Arxiv 260204649)
Freshness: Last updated February 2026.
Geography: United States

Users should consult Arxiv paper 260204649 for the specific definitions and methodologies used to define rationale consistency; licensed under CC BY 4.0.

Size Categories10 Kn100 K Task Categoriesquestion Answering Languageen Licensecc By 40 Regionus Llm As Judge Task Categoriestext Classification Arxiv260204649 Metajudge Rationale Consistency Reward Model

Related Datasets

Quality Score

D37

Description

36

Source

36

Reputation

52

Access

22

Community

711 downloads

24 likes

0 views

Dataset Info

Author: Qwen
Created: Feb 2, 2026
Updated: Feb 5, 2026
Last synced: Jul 3, 2026

Access

22

Community

711 downloads

24 likes

0 views

Dataset Info

Author: Qwen
Created: Feb 2, 2026
Updated: Feb 5, 2026
Last synced: Jul 3, 2026

RationaleRM: 10K-100K Samples for Aligning Reward Model Reasoning

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info