Name: Paper Conclusion RL Training: Qwen3-VL-8B-Thinking with External Judge
Creator: SII-ChengqiLi
Published: 2026-04-10T10:18:23
Keywords: Academic Text, Model Training, Text, Reinforcement Learning, Large Language Models

Description

Paper Conclusion RL Training is a dataset for reinforcement learning training based on the EasyR1 (verl) framework. The training model is Qwen3-VL-8B-Thinking, using an external judge model (Qwen3-4B-Instruct-2507) to score predicted conclusions against a 235B teacher model's reference conclusions. The dataset was authored by SII-ChengqiLi and last updated on 2026-04-10.

Use Cases

Fine-tuning language models for academic conclusion generation based on the described RL training framework.
Benchmarking model-generated text against a teacher model's output based on the described scoring methodology.
Studying the effectiveness of reinforcement learning with human feedback (RLHF) variants for text quality improvement.
Developing reward models for text generation tasks using the described judge model architecture.

Strengths

Training framework (EasyR1/verl) and configuration files are provided, offering a reproducible setup.
Uses a specific, large teacher model (235B) and a defined judge model (Qwen3-4B-Instruct-2507) for scoring.
Includes prompt templates (paper_conclusion_json.jinja) for structured input formatting.

Limitations

Description metadata is limited; actual data quality requires manual inspection after download.
Column-level documentation is absent; field semantics must be inferred after download.
Row count, file formats, and sample data are unknown, which may limit suitability assessment.

Provenance

Source: huggingface
Collection Method: Likely generated for reinforcement learning training of a language model on academic text tasks.
Time Range: null
Freshness: Last updated 2026-04-10 10:28:24; freshness should be verified.
Geography: null

License is unknown; terms of use must be verified before application.

Text Academic Text Model Training Reinforcement Learning Large Language Models

Paper Conclusion RL Training: Qwen3-VL-8B-Thinking with External Judge

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info