Name: AstralBench: 50 High-Quality Mathematical Problems for Model Benchmarking
Creator: nguyen599
Published: 2026-02-13T05:50:43
Keywords: Task Categoriestext Generation, Mathematical Reasoning, Librarypolars, Languageen, Math Reasoning, Size Categoriesn1 K, Modalitytext, Mathematics, CSV, Librarymlcroissant, Licensecc By Sa 40, Librarydatasets, Benchmark, Librarypandas, Problem Solving, Text, Regionus, Olympiad Math

Description

AstralBench is a curated collection of 50 mathematical problems selected from sources like IMO AnswerBench, Project Euler, and Putnam for benchmarking AI model performance. The dataset, created by author nguyen599 and last updated on 2026-03-25, covers diverse topics and difficulty levels. Current model performance on these problems reportedly ranges from 5% to 30% accuracy.

Use Cases

Benchmarking model accuracy on high-quality mathematical problems based on the described performance range.
Evaluating model performance across diverse difficulty levels as mentioned in the description.
Testing model generalization on problems sourced from various mathematical competitions and problem sets.
Analyzing failure modes in mathematical reasoning based on the low reported accuracy scores.

Strengths

Contains 50 carefully curated problems, providing a focused benchmark set.
Covers diverse mathematical topics and difficulty levels, as stated in the description.
Sourced from established problem sets like IMO AnswerBench, Project Euler, and Putnam.

Limitations

Row count, column definitions, and file formats are unknown, limiting suitability assessment.
Description metadata is limited; actual data quality and structure require manual inspection after download.
The dataset's small size of 50 problems may not be representative of broader mathematical reasoning tasks.

Provenance

Source: Curated from multiple sources including IMO AnswerBench, Project Euler, HMMT, SMT, USA-TSTST, USEMO, EGMO, CMO, Pumac, Putnam, open-rl, and mit-math.
Collection Method: Problems were selected from various sources, with those having non-integer and symbolic answers being filtered.
Time Range: null
Freshness: Last updated 2026 03 25 07:32:59; freshness should be verified.
Geography: null

License is unknown; terms of use should be verified before application.

Text CSV Task Categoriestext Generation Mathematical Reasoning Librarypolars Languageen Math Reasoning Size Categoriesn1 K Modalitytext Mathematics Librarymlcroissant Licensecc By Sa 40 Librarydatasets Benchmark Librarypandas Problem Solving Regionus Olympiad Math

AstralBench: 50 High-Quality Mathematical Problems for Model Benchmarking

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info