Loading...
Loading...
Mathematical datasets, statistical benchmarks, probability, optimization, operations research
2,487 datasets
A subset of the task data used to construct the SYNTHETIC-1 collection, created by PrimeIntellect and last updated in February 2025. It contains mathematical problems for text-based problem-solving tasks. The dataset is tagged for Mathematics, Text, and Synthetic Data.
The MATH dataset is a collection of mathematical problems for evaluating problem-solving capabilities. It was created by researchers including Dan Hendrycks and Collin Burns and published at NeurIPS in 2021. The dataset is hosted on Hugging Face by EleutherAI and was last updated in January 2025.
ProcessBench is a benchmark dataset proposed by the Qwen Team for evaluating the identification of process errors in mathematical reasoning. The dataset is hosted on Hugging Face and was last updated on December 27, 2024. The associated GitHub repository contains evaluation code and prompt templates used in the work.
Rhineland-Palatinate's official location register combines municipal lists from the state's Statistical Office and its Office for Surveying and Geo-Based Information. The dataset is a presentation service provided via WMS and was last updated on November 6, 2024. It is published by the Bundesamt für Kartographie und Geodäsie.
Rhineland-Palatinate's official register of municipalities and cities with fewer than 5,000 residents. The list is maintained by the Statistical Office of Rhineland-Palatinate and the State Office for Surveying and Geo-Based Information. It was last updated on November 6, 2024.
OpenLongCoT-Pretrain is a dataset referenced in the LLaMA-Berry research paper for pairwise optimization in mathematical reasoning. The dataset likely contains training examples aimed at achieving high-level mathematical problem-solving performance, as described in the associated arXiv preprint. It was uploaded to Hugging Face by the author di-zhang-fdu on October 28, 2024.
ARPA-E Grid Optimization Challenge 1 data from 2018-2019 provides synthetic power system network models for the Security Constrained AC Optimal Power Flow problem. The collection includes Real-Time and Online datasets with operating scenarios defining instantaneous power demand, renewable generation, and component availability. It was used for a competition requiring solvers to compute a base case operating point and verify feasibility across contingencies.
August 2023 Event 4 data includes 591 synthetic scenarios derived from 9 network models, totaling 3.6 GB. The dataset supports the ARPA-E Grid Optimization Competition Challenge 3, focusing on security-constrained optimal power flow problems for multiperiod dynamic markets. It contains results from 14 teams who solved 669 scenarios, with funding and prizes awarded across multiple competition events.
140,124 contest-level math problems formalized in the Lean 4 theorem prover, created by internlm and released in October 2024. The dataset includes natural language statements, answers, formal statements, and formal proofs where available. It is intended to support the training of autoformalization models and automated proof search.
Prooffol is a dataset uploaded to Hugging Face by author ramyakeerthyt on 2024-11 06. The title suggests it likely contains formal proofs or logical statements. The dataset's specific content, size, and structure require verification after download.
DeepSeek-Prover V1 contains between 10,000 and 100,000 synthetic mathematical proof records designed for the Lean proof assistant. Developed by deepseek-ai and released in 2024, this dataset facilitates the training and evaluation of large language models in formal mathematical reasoning.
A synthetically generated Chain of Thought (CoT) version of the TAT-QA arithmetic dataset, created by prompting Llama3 70B Instruct. The dataset was produced by Cerebras as part of their work on Cerebras DocChat, a document-based conversational Q&A model, to address arithmetic reasoning errors. It was last updated on August 19, 2024.
A collection of 29,000 theorems compiled from over 100 Lean 4 repositories. It was created by InternLM to support the development of theorem provers, including the fine-tuned 7B model InternLM2-Step-Prover.
Historical lottery draw results integrated with astronomical data, developed by szczyglis-dev and last updated in August 2024. The repository provides a Jupyter notebook demonstrating statistical analysis, linear regression, and visualization of number distributions.
A proof-of-concept collection of French question-context pairs designed for training and evaluating embedding models in the financial domain. The dataset was created by sujet-ai and last updated on July 28, 2024. It contains hand-selected examples from publicly available French financial documents.
PutnamBench comprises over 1300 manual formalizations of problems from the William Lowell Putnam Mathematical Competition between 1965 and 2023. The benchmark supports three formal languages: Lean 4, Isabelle, and Coq. It was created by amitayusht and last updated on Hugging Face in June 2024.
TheoremQA is a dataset of 800 question-answer pairs created by human experts at TIGER-Lab. It covers over 350 theorems across mathematics, electrical engineering & computer science, physics, and finance. The dataset was uploaded to Hugging Face on May 15, 2024, and is intended as a benchmark for testing large language models on university-level problem-solving.
Six sets of Matlab files containing 50 samples each of partial discharge (PD) signals and corresponding voltage impulses, sampled at 20 GSps. The data was collected using a Vivaldi antenna in response to sudden voltage changes and is sorted by applied voltage amplitude. The dataset was authored by Juan Manuel Martínez-Tarifa and last updated in May 2024.
Nearly one million instructions in JSON format cover topics like calculus, probability, algebra, and trigonometry. The dataset was created by ajibawa-2023 and released on the Hugging Face platform, with a last recorded update in May 2024. It is structured for instruction tuning to support model development and research.
A 2024 dataset by Severino Fernández Galán presents a novel method for visualizing algebraic fractals. The method colors points in the complex plane based on the minimum modulus within their generated sequences, offering aesthetic views of prisoner sets. It was harvested from the e-cienciaDatos Dataverse platform.