Name: MATH-500 Best-of-N Weighted Selection Results for LLM Evaluation
Creator: cmpatino
Published: 2026-04-23T10:26:25
Keywords: Math Benchmark, Benchmark, Llm Evaluation, Tabular, Test Time Compute, Best Of N, Reward Model

Description

500 math problems from the HuggingFaceH4/MATH-500 benchmark were used to evaluate Best-of-N weighted selection. The dataset contains results from an internship exercise exploring how test-time compute scaling with reward models can improve LLM performance. It was authored by cmpatino and last updated on April 23, 2026.

Use Cases

Benchmarking LLM performance improvements based on Best-of-N weighted selection results.
Analyzing the relationship between test-time compute scaling and accuracy on math problems.
Studying the effectiveness of reward models for guiding LLM generation on reasoning tasks.
Comparing different selection strategies for improving model outputs at inference time.

Strengths

Based on a known benchmark of 500 math problems (MATH-500).
Created as part of a structured HuggingFace internship exercise, suggesting methodological rigor.
Explicitly focuses on the emerging technique of test-time compute scaling with reward models.

Limitations

Column-level documentation is absent; field semantics must be inferred after download.
Row count and file formats are unknown, which may limit suitability assessment.
Description metadata is limited; actual data quality requires manual inspection after download.

Provenance

Source: huggingface
Collection Method: Constructed by evaluating Best-of-N weighted selection on a subset of the HuggingFaceH4/MATH-500 benchmark.
Time Range: null
Freshness: Last updated 2026-04-23 13:06:53; freshness should be verified.
Geography: null

License is unknown; users must verify permissions before use.

Tabular Math Benchmark Benchmark Llm Evaluation Test Time Compute Best Of N Reward Model

MATH-500 Best-of-N Weighted Selection Results for LLM Evaluation

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info