Loading...
Loading...
Mathematical datasets, statistical benchmarks, probability, optimization, operations research
2,462 datasets
308,000 reasoning traces were distilled from the Hunter Alpha model via the OpenRouter platform. The dataset contains 1.2 billion tokens and was created by user 'ianncity', with a last recorded update in March 2026. It is composed of 30% math problems, 30% coding tasks, 15% science topics, 15% computer science, and 10% creative writing.
A 2000-row, 100-column subsample of the first-order-theorem-proving dataset, generated with a random seed of 4. The subset contains up to 10 target classes and was created using a stratified sampling method. Author Eddie Bergman released it under a US public domain license on the OpenML platform.
A subsampled version of the 'first-order-theorem-proving' dataset from OpenML, created by Eddie Bergman. The subset was generated with a random seed of 0, targeting a maximum of 2000 rows, 100 columns, and 10 classes, using stratified sampling. It likely contains tabular data related to automated theorem proving in first-order logic.
A 2000-row, 100-column subsample of the first-order-theorem-proving dataset, created with a specific random seed. The subset contains up to 10 target classes and was generated using a stratified sampling method. Eddie Bergman is listed as the author, and the data is released under a US public domain license.
first-order-theorem-proving_seed_1_nrows_2000_nclasses_10_ncols_100_stratify_True is a subsampled version of the first-order-theorem-proving dataset from OpenML. It was generated by Eddie Bergman using a script to uniformly sample rows, columns, and classes. The dataset likely contains tabular data related to automated theorem proving tasks.
2000 rows of data derived from the original first-order-theorem-proving dataset via a controlled subsampling process. The dataset was created by Eddie Bergman and is shared under a US public domain license on the OpenML platform. It is a tabular dataset likely used for machine learning tasks related to automated reasoning.
Great Britain's district-level election results are analyzed using a statistical model proposed by Jonathan N. Katz of California Institute of Technology. The model explains how results depend on economic conditions, ethnic composition, and campaign spending. It was used to estimate party-specific incumbency advantages, finding small but meaningful effects.
A multimodal Arabic mathematics dataset curated for supervised fine-tuning of vision-language models. Each sample pairs a geometric or algebraic diagram with an Arabic-language problem statement and its corresponding solution. The dataset was created by Omartificial-Intelligence-Space and last updated on Hugging Face in April 2026.
100 Monte Carlo simulations quantify the dispersion of the Integral Time Absolute Error metric for PI-MPCC, SMSC-MPCC, and MPDSC motor controllers under ±50% parameter mismatch. This 5.5 KB Excel dataset, authored by Magdy Meawad, provides a compact benchmark for control system robustness analysis.
Charts of the cohomology of the Mod 2 Steenrod algebra up to total degree 261 are provided in multiple formats, including CSV and SQLite. The data includes an interactive plot of the E2 page of the Adams spectral sequence. Author Weinan Lin was supported by the China Postdoctoral Science Foundation and Peking University.
Replication data and code for the paper 'Carbon Pricing, Capital Bias, and Electric Vehicle Adoption' by Antweiler. The archive contains R code for simulating EV adoption and a Simulated Method of Moments estimator, along with Maple code for algebraic derivations. It was harvested from Borealis Dataverse and last updated on 2026-04-25.
Land-Form PANORAMA provides a digital terrain model (DTM) of the United Kingdom, derived from Ordnance Survey's 1:50,000 scale Landranger map contours. The dataset consists of a grid of height values at 50-meter intervals, interpolated from contour data with a vertical interval of 10 meters. It was created by EDINA and is based on source data from the Ordnance Survey.
Physical statistical models analyze the Northeast Ice Stream in Greenland to understand processes controlling rapid ice flow. The approach combines geophysical models with Bayesian statistical methodology, using new remote-sensing observational data. The project was conducted by SCIOPS, exploring basal and surface elevations, velocity, and stress fields.
PARADIGM Benchmark Suite contains a sampled subset of 10 benchmarks used in the paper 'Select-then-Solve: Paradigm Routing as Inference-Time Optimization for Language Agents'. The dataset was created by author 'henggg' and last updated on April 9, 2026. It provides a fixed sample for evaluating six reasoning paradigms across four frontier LLMs.
Geoscience Australia Data presents a study applying Q-mode and R-mode factor analysis, discriminant analysis, and regression to geochemical data from Broad Sound, Queensland. The research classifies estuarine sediment samples into two geologically distinct groups and identifies processes controlling concentrations of P2O5, Cu, Pb, and Zn. The dataset underpinning the analysis is described in associated HTML and PDF documents.
GSM8K (Grade School Math 8K) is a dataset of 8.5K high-quality, linguistically diverse grade school math word problems. It was developed by OpenAI to support the task of question answering on basic mathematical problems that require multi-step reasoning. The problems are designed to be solved in 2 to 8 steps using basic arithmetic operations.
Code-Point with polygons provides notional boundary shapes for every postcode unit in Great Britain, including high-rise building floors. The dataset includes positional quality ratings, eastings and northings coordinates, and administrative codes such as NHS health authority codes. It is produced by the Government Digital Service.
Supplementary materials for a study titled 'Mechanistic insights into bluetongue virus immunodynamics: a bayesian within-host modeling approach'. The dataset, published on figshare by Abhijit Majumder, was last updated on 2026-05-07. It likely contains data related to parameter estimation diagnostics, convergence analysis, and vector transmission dynamics.
A 1:50,000-scale map estimates relative slope failure likelihood across 872 square kilometers at 30-meter resolution. Prepared in a GIS from a statistical model, it combines 120 geologic-map units, slope data, and inventories of 6,714 old landslide deposits and 1,192 post-1970 landslides. The model was developed to aid land-use and zoning decisions for metropolitan Oakland, California.
Arithmetic In The Wild Das is a dataset by author Tal535, hosted on Hugging Face. Its title suggests it contains mathematical problems or calculations collected from real-world or varied sources. The dataset was last updated on May 21, 2026, but specific content, size, and structure are not detailed in the available metadata.