Loading...
Loading...
Available on 1 platform
Sign in to view source links and access this dataset
500 math problems from the HuggingFaceH4/MATH-500 benchmark were used to evaluate Best-of-N weighted selection. The dataset contains results from an internship exercise exploring how test-time compute scaling with reward models can improve LLM performance. It was authored by cmpatino and last updated on April 23, 2026.
License is unknown; users must verify permissions before use.