BakeAI's preview dataset contains 50 challenging university-level mathematics reasoning problems. Each problem includes a detailed reference solution, a structured grading rubric, and an anonymized model evaluation result.
Use Cases
- Benchmark model performance on multi-step reasoning problems using the structured grading rubric for evaluation.
- Analyze frontier model attempts against point-by-point grading criteria to identify failure modes in complex computation.
- Train or fine-tune models for proof construction tasks using the provided reference solutions as supervision.
Strengths
- Contains 50 challenging, university-level problems designed for multi-step reasoning.
- Each entry includes a detailed reference solution and a structured point-by-point grading rubric.
- Provides anonymized model evaluation results against the rubric for benchmarking.
Limitations
- Small sample size of 50 problems limits statistical power for broad model evaluation.
- Preview nature suggests the dataset may be incomplete or a subset of a larger collection.
- Geographic or topical bias is possible as the region is listed as US and focus is university-level math.
Provenance
- Source
- BakeAI
- Collection Method
- null
- Time Range
- null
- Freshness
- null
- Geography
- United States (based on 'Regionus' tag)