Name: QRPO Paper: Llama SFT LeetCode Sandbox Reference Completions and Rewards
Creator: skandermoalla
Published: 2025-12-08T10:24:54
Keywords: Off Policy, Alignment, Tabular, Code Generation, Reinforcement Learning, Large Language Models

Description

A dataset from the huggingface platform, created by author skandermoalla and last updated on December 8, 2025. It contains reference completions and rewards for a specific model and reward model, intended for training with the QRPO reference codebase. This collection supports the paper 'Quantile Reward Policy Optimization: Alignment with Pointwise Regression and Exact Partition Functions'.

Use Cases

Training reinforcement learning agents for code generation based on reference completions and reward signals.
Benchmarking off-policy optimization algorithms based on the provided reward model outputs.
Reproducing experiments from the QRPO research paper based on the linked dataset collection.
Fine-tuning language models on programming tasks using structured reward feedback.

Strengths

Dataset is explicitly linked to a published research paper (arXiv:2507.08068).
Includes reference completions and rewards, which are key components for reinforcement learning from human feedback (RLHF) workflows.
Last update timestamp (2025-12-08 13:27:14) is provided.

Limitations

Column names, data types, and sample rows are unknown, requiring inspection after download.
The total number of rows, file formats, and license information are not specified.
Data may reflect bias inherent to the specific model, reward model, and sandbox environment used for generation.

Provenance

Source: huggingface
Collection Method: Likely generated from a specific model and reward model within a LeetCode sandbox environment, as part of research for the QRPO paper.
Time Range: null
Freshness: Last updated 2025-12-08 13:27:14.
Geography: null

Designed for use with the specific QRPO reference codebase (github.com/CLAIRE-Labo/quantile-reward-policy-optimization); compatibility with other frameworks is unknown.

Tabular Off Policy Alignment Code Generation Reinforcement Learning Large Language Models

QRPO Paper: Llama SFT LeetCode Sandbox Reference Completions and Rewards

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info