DataSalon

Discover quality datasets for AI training — aggregated from 40+ platforms, curated by AI.

ProductSearch Datasets Browse Topics Rankings Community API / MCP

ResourcesDocumentation Blog Changelog Status

LegalPrivacy Policy Terms of Service Cookie Policy

Mathematics & Statistics Datasets | DataSalon

All Categories

📐

Mathematics & Statistics

Mathematical datasets, statistical benchmarks, probability, optimization, operations research

2,487 datasets

Math Olympiad Problem Formalizations for Theorem Proving

MathOlympiadBench contains human-verified formalizations of Olympiad-level mathematical competition problems. The dataset was created by Goedel-LM and introduced in a paper published in 2025. It sources problems from the Compfiles and IMOSLLean4 repository.

TextMathematical ReasoningFormal VerificationTheorem ProvingCompetition Problems+1

0 views

Mathematics & Statistics

Citizen Appeals and Information Requests to the Chernihiv Region Tax Service

Statistical information on public information requests and citizen appeals processed by the Main Directorate of the State Tax Service in the Chernihiv region of Ukraine. The dataset includes reports on request satisfaction and work plan implementation, aggregated from the State site of Ukraine. It was last updated on August 4, 2025.

TextTabularZIPCSVInformation RequestsCitizen AppealsTax ServicePublic AdministrationGovernment Reports+1

0 views

Mathematics & Statistics

MEDLINE/PubMed Baseline Statistics: Element Counts and Sizes, 2018-2023

MEDLINE/PubMed annual statistical reports detail the content and size of data elements in the baseline versions of the database for 2018-2023. The reports include total citations and occurrences per element, plus minimum, average, and maximum occurrences and lengths. The data is provided by datadiscovery.nlm.nih.gov and was last updated on 2025-06-18.

TabularCSVXMLJSONMedical LiteratureBenchmarkStatisticsMetadata AnalysisBibliometrics+1

0 views

Mathematics & Statistics

New York Licensed Home Care Agency Patient Cases and Discharges

Annual data from health.data.ny.gov reports statewide Licensed Home Care Services Agency (LHCSA) activity. It includes agency-level totals for patient counts, cases, and discharges for each reporting period. The dataset was last updated in July 2025.

TabularCSVXMLJSONPatient CensusNew York StateLicensed Home Care Services AgencyHealthcare StatisticsHome CareHome Care ServicesHdnyhcasrLhcsa+1

0 views

Mathematics & Statistics

New York Licensed Home Care Service Cases by County and Type

Annual statistical reports detail the number of cases for 18 distinct home care services across New York State counties. The dataset is provided by the New York State Department of Health via health.data.ny.gov. Data was last updated in July 2025.

TabularCSVXMLJSONHealthcare AdministrationNew York StateLicensed Home Care Services AgencyCounty-Level DataHome CareHome Care ServicesHdnyhcasrLhcsa+1

0 views

Mathematics & Statistics

AutoMathText: 200 GB of Mathematical Text from Web, arXiv, and GitHub

AutoMathText is a dataset of approximately 200 GB of mathematical texts compiled from sources including various websites, arXiv, and GitHub repositories like OpenWebMath, RedPajama, and Algebraic Stack. The dataset was created by author 'math-ai' and its associated work was accepted to ACL 2025 Findings. The dataset was last updated on July 16, 2025.

TextTask Categoriestext GenerationMathematical ReasoningMathematical TextTask Categoriesquestion AnsweringSize Categories1 Mn10 MLanguageenLanguage Model PretrainingArxiv240207625Text GenerationModalitytextLicensecc By Sa 40PretrainingQuestion AnsweringLarge Language ModelRegionusReasoningFinetuning+1

0 views

Mathematics & Statistics

NuminaMath-1.5: Strictly Filtered Mathematical Proof Problems

A strictly filtered subset of the NuminaMath-1.5 dataset containing only validated mathematical proof problems. The dataset was created by author 'nlile' and last updated on July 11, 2025. It applies multiple validation filters to the original data to ensure problem and solution validity.

TextMathematicsProofsFiltered Dataset+1

0 views

Mathematics & Statistics

MEDLINE/PubMed Baseline Statistics: Element Counts and Lengths, 2002-2023

MEDLINE/PubMed annual statistical reports detail the content and size of data elements within the biomedical literature database. The reports include counts of citations and element occurrences, plus minimum, average, and maximum values for occurrences and lengths per record. The data is provided by datadiscovery.nlm.nih.gov and was last updated on June 18, 2025.

TabularCSVXMLJSONMedical LiteratureBenchmarkStatisticsMetadata AnalysisBibliometrics+1

0 views

Mathematics & Statistics

MiroMind-M1: 719K Problems for Mathematical Reasoning Model Training

MiroMind-M1 is a fully open-source series of reasoning language models built on Qwen-2.5. The dataset contains 719,000 curated problems used for supervised fine-tuning, with an additional 62,000 challenging examples used for reinforcement learning. It was created by miromind-ai and last updated on July 22, 2025.

TextParquetMathematical ReasoningLibrarypolarsLibrarydaskTraining DataLanguageenAi EvaluationModalitytextSize Categories100 Kn1 MLibrarymlcroissantLibrarydatasetsLanguage ModelRegionusArxiv250714683Licenseapache 20+1

0 views

Mathematics & Statistics

MiroMind-M1 RL Training Performance on AIME24 and AIME25

Training performance data for the MiroMind-M1-RL-7B model on the AIME24 and AIME25 benchmarks. The dataset is associated with a model trained via reinforcement learning with verifiable rewards on 62,000 challenging examples. It is authored by miromind-ai and was last updated in July 2025.

ParquetSize Categories10 Kn100 KLibrarypolarsLanguageenModalitytextLibrarymlcroissantLibrarydatasetsLibrarypandasRegionusArxiv250714683Licenseapache 20+1

0 views

Mathematics & Statistics

Hazardous Materials Incident Reports by Type and Geography

Series of statistical reports on hazardous materials incidents compiled from the Hazardous Materials Incident Report Form 5800.1. The data is produced by the U.S. Department of Transportation and provides information on incidents by type, year, and geographical location. The latest update was in July 2025.

PhmsaHazardousHazardous MaterialIncidentReportMaterialHazmat+1

0 views

Mathematics & Statistics

DAFT Math: 199 Challenging Free-Response Problems for LLM Evaluation

199 challenging mathematical problems designed to be at the limit of current LLM abilities. The dataset, named DAFT Math (Difficult Automatically-scorable Free-response Tasks for Math), was created by metr-evals and last updated on July 17, 2025. It is presented as a research artifact for a niche use-case.

TextCSVLibrarypolarsTask Categoriesquestion AnsweringSize Categoriesn1 KModalitytextMathematicsModalitytabularLibrarymlcroissantLibrarydatasetsBenchmarkLibrarypandasQuestion AnsweringLlm EvaluationFree ResponseRegionusMath+1

0 views

Mathematics & Statistics

Optimization Problem Instances For Operations Research

Instance datasets of operations research problems. The repository, created by Oscar-Oliveira, was last updated on September 9, 2025. It contains curated datasets for classic optimization problems.

TabularFacility LocationOperational ResearchOperations ResearchResearchOptimization ProblemsCutting And PackingCutting Packing+1

0 views

Mathematics & Statistics

Nemotron-Math-HumanReasoning: Olympiad-Level Math Solutions

NVIDIA released this dataset in July 2025 containing fewer than 1,000 human-written solutions to complex math problems. It features extended reasoning chains authored by Olympiad-level mathematics students to emulate the step-by-step logic of advanced reasoning models.

JSONLibrarypolarsSize Categoriesn1 KModalitytextLibrarymlcroissantLibrarydatasetsLibrarypandasLicensecc By Nc 40RegionusArxiv250709850+1

0 views

Mathematics & Statistics

MFR: Mathematical Formula Retrieval with 71 Core Identities

A collection of mathematical formula pairs classified for equivalence. The dataset is based on 71 famous mathematical identities. It was created by author 'ddrg' and last updated on July 8, 2025.

TextMathematicsText PairsFormula RetrievalEquivalence Classification+1

0 views

Mathematics & Statistics

U.S. Freight Analysis Framework Regions from 2017 Base Year

132 geographic zones define U.S. domestic freight regions, classified as Metropolitan Areas, Remainder of State areas, or Whole States. The dataset originates from the 2017 Commodity Flow Survey and was published by the Bureau of Transportation Statistics in April 2022.

FrameworkRoadsBoundariesPolygonAtlasFreightTransportationAnalysisRegionsDatabaseUnited StatesHighwayVectorNational Transportation Atlas DatabasePlanningNationalNtadNetwork+1

0 views

Mathematics & Statistics

CombiBench: 100 Combinatorial Math Problems in Lean 4 Formal Language

CombiBench consists of 100 manually produced combinatorial mathematics problems encoded in the Lean 4 formal language. Developed by AI-MO and updated in July 2025, it serves as a specialized benchmark for assessing the reasoning capabilities of automated theorem proving systems.

ParquetLibrarypolarsSize Categoriesn1 KModalitytextLibrarymlcroissantLibrarydatasetsLibrarypandasRegionusLicensemit+1

0 views

Mathematics & Statistics

DeepTheorem: Natural Language Theorem Proving Data for LLM Training

DeepTheorem is a framework for enhancing large language model mathematical reasoning through informal, natural language-based theorem proving. The dataset was created by Jiahao004 and last updated on Hugging Face on July 3, 2025. It introduces a novel approach to automated theorem proving by leveraging the informal reasoning strengths of LLMs.

TextMathematical ReasoningLlm TrainingNatural LanguageNatural Language ProcessingTheorem Proving+1

0 views

Mathematics & Statistics

AceReason-Math: 49,000 Verifiable Math Problems for Reasoning

NVIDIA's AceReason-Math contains 49,000 math problems and answers curated in June 2025 for training reasoning models via reinforcement learning. The collection is sourced from NuminaMath and DeepScaler-Preview, filtered to ensure all tasks are verifiable and text-based.

JSONSize Categories10 Kn100 KTask Categoriestext GenerationLibrarypolarsLanguageenArxiv250516400ModalitytextLibrarymlcroissantLibrarydatasetsLibrarypandasLicensecc By 40RegionusReasoningArxiv250613284Math+1

0 views

Mathematics & Statistics

Manufacturing Supply Chain Optimization Documents

SupplyChainOptimization is a collection of documents discussing manufacturing processes. The dataset contains texts on logistics management, cost reduction, and demand forecasting methods. It was created by infinite-dataset-hub and last updated in June 2025.

CSVLibrarypolarsSize Categoriesn1 KModalitytextLibrarymlcroissantLibrarydatasetsLibrarypandasRegionusInfinite Dataset HubLicensemitSynthetic+1

0 views

PreviousPage 111 of 125Next