Loading...
Loading...
Mathematical datasets, statistical benchmarks, probability, optimization, operations research
2,485 datasets
The Bureau of Transportation Statistics TransBorder Freight program provides U.S. cross-border freight data with Canada and Mexico. Data includes mode of transportation, commodity type, and geographic detail for exports and imports, used for trade corridor studies and infrastructure planning. BTS publishes a monthly statistical release highlighting key trends.
nimbleSCR provides utility functions, distributions, and fitting methods for Bayesian Spatial Capture-Recapture (SCR) and Open Population Spatial Capture-Recapture (OPSCR) modeling. The package, authored by Richard Bischof, is built using the nimble package and was motivated by the need for flexible and efficient analysis of large-scale SCR data.
Data sets and scripts for analyzing time series in both the frequency and time domains, including state space modeling. The collection supports the textbooks 'Time Series Analysis and Its Applications: With R Examples' (5th ed, 2025) and 'Time Series: A Data Analysis Approach Using R' (2nd ed, 2026). Most scripts are designed to require minimal input to produce aesthetically pleasing output for ease of use in live demonstrations and course work.
A 2015 software package created by Deepankar Datta to carry out Bland-Altman analyses, also known as Tukey mean-difference plots. The package was developed to address the lack of confidence interval calculations in existing functions and to create reproducible plots, with an available module for the 'jamovi' statistical spreadsheet.
GPUMODE released this dataset in early 2026, containing between 100,000 and 1,000,000 GPU kernel submissions from the KernelBot competition platform. The collection focuses on optimized code specifically targeting AMD MI300 hardware and includes subsets for successful and deduplicated entries.
Kaggle hosts this dataset, which appears to be a benchmark for evaluating the Qwen3-1.7B language model. The title suggests it involves tasks combining summarization and arithmetic reasoning. The dataset's author, size, and specific contents are not detailed in the provided metadata.
Ardmucknish Bay, Scotland, hosts data from a 2012 sub-seabed CO2 controlled release experiment assessing impacts on sedimentary phosphorus. The study, published in the International Journal of Greenhouse Gas Control, found no statistically significant effects on solid-phase P content during the experiment. Laboratory analyses using the SEDEX sequential extraction technique revealed differences in P release potential among sediment types.
The dataset denotes boundaries for Community Development Block Grant (CDBG) Entitlement Communities and State Administered Non-Entitlement grantees. CDBG is a federal block grant distributed via formula to states and local governments for housing, economic development, and public improvement efforts serving low and moderate-income communities. The Department of Housing and Urban Development maintains this dataset, last updated on March 11, 2026.
Featuring chunked content from 12 open-source mathematics textbooks, including works like 'An Infinitely Large Napkin' and 'Mathematical Reasoning: Writing and Proof'. It is intended for retrieval-augmented generation, embedding, and math reasoning research. The source code for the data pipeline is publicly available on GitHub.
A statistical investigation of customer insights, likely containing data for analysis. The dataset is hosted on Kaggle, but its specific origin and creation date are unknown. The number of records and features are not specified in the available metadata.
Smart Tourism Service Quality Dataset is a collection of records related to tourism service optimization, likely gathered via Internet of Things (IoT) devices. The dataset is hosted on Kaggle, but details about its creator, size, and specific contents are not provided. Its structure and specific variables are unknown from the available metadata.
Bayesian hierarchical model data likely contains parameters, hyperparameters, or simulated observations for statistical analysis. The dataset is hosted on Kaggle, a platform for data science projects. Its specific source, size, and creation date are unknown.
This dataset links historical life trajectories from the Historical Sample of the Netherlands (HSN) for individuals born between 1812 and 1922 to contemporary outcomes in the System of Social statistical Datasets (SSD). It represents a Proof of Concept linkage, with a revised strategy successfully linking 77% of linkable HSN records. The linkage is based on matching birth dates of the individual, father, and mother, marriage date, and sex.
22,532 programming problems generated by AI, inspired by real scientific computing code snippets. Each problem is paired with a solution and focuses on concepts like numerical algorithms, data analysis, and mathematical modeling. The dataset was created by SciCode and was last updated on 2026-02-19.
Three national geological models covering Great Britain estimate the thickness of Quaternary and younger deposits. The British Geological Survey derived these 50 m x 50 m grids by interpolating borehole records and map data. Models provide indicative thickness values and proximity to source data for geohazard assessment.
15,000 multi-modal tensors combine CLIP embeddings with statistical features for deepfake detection. The dataset is optimized for direct training of machine learning models. The author, organization, and last update date are unknown.
MathVision-Latex pairs images of handwritten mathematical expressions with corresponding LaTeX code. The dataset appears designed for training models to recognize and transcribe mathematical handwriting. Its source and scale are not detailed in the provided metadata.
IOSR Journals presents a dataset from a paper analyzing numerical solutions to initial value problems for ordinary differential equations. The data likely contains results from solving several example problems using the Euler method, comparing approximate and exact solutions. The analysis investigates and computes error for different step sizes.
Caner Aktas provides S4 classes and methods for reading and manipulating aligned DNA sequences. The package supports indel-coding, shows base substitutions and indels, calculates pairwise distances, and collapses sequences into haplotypes. It also includes methods for estimating genealogical relationships among haplotypes using statistical parsimony and plotting parsimony networks.
rdrobust is a package for statistical inference in regression-discontinuity (RD) designs, a quasi-experimental method popular in social, behavioral, and natural sciences. It provides tools for point estimation, robust confidence intervals, bandwidth selection, and exploratory data analysis in Sharp, Fuzzy, and Kink RD settings. The package was authored by Sebastian Calonico.