Description

QEDBench is a benchmark for evaluating large language models on formal proof generation and evaluation. It contains 272 proof-based problems spanning 10 distinct mathematical domains, created by researcher Quanquan C. Liu. The dataset was published in February 2026.

Use Cases

Benchmarking LLM performance on proof-based problems across 10 mathematical domains using the provided 272 problems.
Analyzing error patterns in formal proof generation by evaluating model outputs against the benchmark's proof standards.
Training or fine-tuning models for mathematical reasoning using the structured proof problems and their associated domains.
Studying the difficulty and distribution of formal reasoning tasks across different areas of mathematics defined in the benchmark.

Strengths

Contains 272 distinct proof-based problems.
Covers 10 different mathematical domains for breadth.

Limitations

Limited to 272 problems, which is a small sample for large-scale training.
Focus is on evaluation, not providing large volumes of training data.
Potential bias towards the specific proof styles and domains selected by the creators.

Provenance

Source: Hugging Face dataset uploaded by author 'qqggez'.
Collection Method: Curated benchmark problems designed for LLM evaluation.
Time Range: null
Freshness: Last updated in March 2026.
Geography: null

License details are not explicitly provided in the input; check the dataset page for terms. The primary use is for evaluation, not as a general-purpose training corpus.

Text Llm Benchmark Mathematical Reasoning Formal Mathematics Arxiv260220629 Benchmark Licensecc By 40 Regionus Proof Evaluation

Mathematical Proof Evaluation Benchmark Across Ten Domains

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info