Loading...
Loading...
Mathematical datasets, statistical benchmarks, probability, optimization, operations research
2,487 datasets
Harvard Dataverse hosts measures of productive, unproductive, and net productive entrepreneurship for metropolitan statistical areas annually. The dataset, created by Gary Wagner, covers the period from 2002 to 2019. The accompanying academic paper details the variable construction methodology.
A collection of challenging mathematics problems designed for the AIMO (likely an AI or Asian International Mathematical Olympiad). The dataset contains text-based problems, curated for educational and competitive training purposes. Its origin and specific size are not detailed in the provided metadata.
The Harmonized Tariff Schedule of the United States for 2025 provides applicable tariff rates and statistical categories for all merchandise imported into the country. It is maintained by the US International Trade Commission and is based on the international Harmonized System for global trade in goods. The dataset includes all revisions for the current year.
Dclm Baseline 1B is a 1 billion token sample created by codelion from the mlfoundations/dclm-baseline-1.0 dataset. It was generated using reservoir sampling to ensure statistical representativeness of the source's filtered, diverse web content. The dataset was last updated on November 2, 2025.
An optimization model written in Pyomo allocates synchronous condensers at minimum cost to guarantee specified short-circuit current levels at transmission nodes. The model includes short-circuit current contributions from inverter-based resources under fault conditions. It was authored by Fatemeh Masoomi and last updated on December 21, 2025.
Sys2Rreasoning 125000 is a synthetic dataset containing algebra-heavy math reasoning problems, created by author xortron. Each entry includes fields for the question, problem, solution method, and answer. The dataset is tagged for fine-tuning and contains text-based word problems in English.
3 datasets containing reasoning and math problems paired with Chain-of-Thought (CoT) traces generated by Llama 3.1 8B Instruct. The collection includes step-level correctness annotations across arithmetic, boolean logic, and math domains to support the training of reasoning verifiers.
Sys2Rreasoning 100000 contains 100,000 synthetic algebra-heavy math reasoning problems. The dataset, created by xortron, includes fields for question, problem, how_to_solve, and answer.
Calls for Service data from the Howard County Police Department's computer-aided dispatch (CAD) system. The dataset includes event type, date, time, location, statistical reporting area (SRA), and beat for calls reported between 2014 and 2023. It is published by opendata.howardcountymd.gov and was last updated in October 2025.
AI-MO provides a structured collection of Olympiad problems and their official solutions. The dataset is organized by competition, such as the International Mathematical Olympiad, and includes raw PDF files. It was last updated on November 6, 2025.
Annual statistical reports from the Offices of the United States Attorneys contain national and district-level caseload data. The reports cover priorities in criminal prosecution and civil litigation, published by the U.S. Department of Justice.
A database supporting a study on a micro-solid phase extraction method for alkaloid detection. It includes UHPLC-MS parameters, validation results, matrix effect data, and AGREEprep scores for the methodology. The dataset, authored by Begoña Fernández-Pintor, was last updated on October 14, 2025.
5 macrolide antibiotics were analyzed in eggs at a concentration of 150 ng/g. The dataset includes results from a Box-Behnken experimental design, recovery percentages for functionalized and non-functionalized membranes, analytical performance metrics, and reproducibility studies. It was authored by Lorena González Gómez and last updated in October 2025.
Updated in December 2025 by bnicenboim, this repository hosts the datasets and models for the textbook "Introduction to Bayesian Data Analysis for Cognitive Science." It provides the specific experimental data and Bayesian modeling scripts necessary to follow the book's pedagogical examples in cognitive research.
IndustryOR provides 100 real-world operations research problems across five modeling types: linear programming, integer programming, mixed integer programming, non-linear programming, and others. CardinalOperations created this benchmark to train large language models for optimization modeling, with the dataset last updated in October 2025.
ATP Tennis Matches Dataset (2015-2025) provides detailed statistics for professional tennis matches over an 11-year period. Yahya777777 compiled this dataset by scraping the official ATP Tour website. The dataset was last updated in September 2025.
2015 to 2025 ATP Tour tennis matches dataset compiled by Yahya777777. It contains detailed match statistics, player information, and tournament data scraped from the official ATP Tour website. The dataset covers an 11-year period of professional tennis.
23 specific alkaloids were simultaneously determined in infusions from dry edible flowers using a microextraction and UHPLC-IT-MS/MS method. The dataset includes RASFF system alerts, analytical parameters, method optimization and validation results, matrix effect and recovery percentages, and AGREEprep scores. Author Begoña Fernández Pintor published this data via the e-cienciaDatos Harvested Dataverse platform, with a last update timestamp of 2025-10-14.
This repository aggregates academic benchmark instances for cutting and packing optimization problems, maintained by the EURO Special Interest Group on Cutting and Packing (ESICUP). Updated in December 2025, the collection provides standardized data for testing algorithms across 2D, 3D, and nesting problem domains.
GSS nine-character codes for UK statistical geographies from 1 January 2009, including details of codes, relationships, and hierarchies. The database is provided as a zip file by the Government Digital Service and contains a snapshot as of December 2017. It is designed for use in conjunction with the Register of Geographic Codes.