Loading...
Loading...
General ML benchmarks, tabular data, AutoML, recommendation systems, anomaly detection, evaluation suites
143,570 datasets
geoBoundaries provides standardized, open-license administrative boundaries for Tonga. The dataset includes ADM0 (country), ADM1, and ADM2 level boundaries, produced and maintained since 2017. It is part of the geoBoundaries Global Database of Political Administrative Boundaries Database.
A novel topological feature engineering model achieved 87.4% accuracy, 84.1% sensitivity, and 89.6% specificity for predicting post-operative mortality in lung transplant recipients. The model, developed by Alexy Tran-Dinh and published on figshare in 2026, integrates static and time-dependent clinical variables to outperform traditional risk scores. It demonstrated an absolute AUC gain of 0.08 over the best non-topological baseline.
Bolivia's administrative boundaries from national to local levels (ADM0 to ADM3). The geoBoundaries Global Database produced and maintains this standardized, open-license resource since 2017. It is available in GEOJSON and SHP formats under an ODbL-1.0 license.
A synthetic dataset simulating candidate profiles for technology jobs in a Brazilian context, designed for benchmarking fairness-aware algorithms. It contains nine partitions combining three sizes (1k, 5k, 10k instances) and three bias conditions (debiased, biased, extreme bias). The dataset was authored by Carvalho and last updated on 2026-05-11.
562 adult patient records from a retrospective cohort study conducted between January 2022 and January 2024. The dataset, created by Mingrui Zhao, was used to develop and compare logistic regression, random forest, and deep learning models for predicting relapse within 12 months.
562 adult patient records were used to develop and compare machine learning models for predicting relapse in idiopathic nephrotic syndrome. The deep learning model achieved the best performance, with a test AUC of 0.883. The dataset, created by Mingrui Zhao and last updated in April 2026, includes baseline clinical and laboratory variables.
125 patient records from a retrospective study of triple-negative breast cancer patients who underwent preoperative multiparametric MRI. The dataset likely contains radiomics features extracted from whole tumors and intratumoral habitat subregions, used to build predictive models for axillary lymph node metastasis. The data was authored by Bo Xie and uploaded to figshare in May 2026.
A prospective study of 100 Indian women with gestational diabetes, followed for up to 12 months postpartum, with 42% developing type 2 diabetes. The dataset, authored by Puja Chebrolu and last updated in April 2026, likely contains clinical measurements such as the insulinogenic index and Matsuda index taken at 6 weeks postpartum to assess associations with diabetes development.
47,828 tree species are represented with trait data for 18 traits, including wood density and leaf area. The data, created by Daniel Maynard, contains both observed and imputed values from a 2022 study and was last updated in April 2026. It supports a consensus clustering method for classifying species into functional groups while accounting for trait uncertainty.
237 patients with granulomatous lobular mastitis or breast cancer underwent preoperative ultrasound examinations at Quzhou People's Hospital between April 2013 and April 2023. Radiomic features were extracted from the images to build interpretable machine learning models for preoperative differentiation. The dataset likely contains the extracted radiomic features and clinical predictors used in the study.
237 patients underwent preoperative breast ultrasound examinations at Quzhou People’s Hospital between April 2013 and April 2023. Radiomic features were extracted from ultrasound images to develop a machine learning model for distinguishing granulomatous lobular mastitis from breast cancer. The dataset includes 1,161 radiomic features per image, with a combined model achieving an AUC of 0.935 in the training cohort.
Processed HiChIP chromatin interaction data identifies loops at FDR < 0.01 using multiple methods. The dataset contains results for GM12878 lymphoblastoid cells (H3K27ac and cohesin targets) and K562 erythroleukemia cells (H3K27ac target). Authored by Weiyue Ding and last updated in May 2026, it provides 158.1 MB of processed analysis results derived from raw sequencing data.
9.5 KB Excel file lists standard heterosis percentages for the top 20 maize genotypes for grain yield. The dataset includes a summary of genotypes showing positive and negative heterosis under optimal and drought conditions, estimated relative to the best check and the mean of checks. It was authored by Goshime Muluneh Mekasha and last updated on June 3, 2026.
Cluster-wise mean and standard deviation values for grain yield, phenological, agronomic, and plant architectural traits of maize genotypes. The data is derived from K-means clustering applied to standardized multi-trait phenotypic data. It was authored by Goshime Muluneh Mekasha and last updated on June 3, 2026.
A dataset of 1,271 companies listed on Korea's KOSPI and KOSDAQ markets from 2016 to 2023. It was created by Xiao Wang to develop an AI model for predicting corporate management performance. The data combines financial variables with strategic indicators derived from text mining CEO messages in sustainability reports.
1,271 listed companies from Korea's KOSPI and KOSDAQ markets form a dataset for predicting corporate management performance. The data combines financial variables with strategic indicators derived from text mining CEO messages in sustainability reports, covering the period from 2016 to 2023. Author Xiao Wang published this dataset on figshare in May 2026 under a CC-BY-4.0 license.
From 2016 to 2023, this dataset contains information on 1,271 listed companies from Korea's KOSPI and KOSDAQ markets, used to predict corporate management performance. It was created by Xiao Wang and combines financial variables with strategic indicators derived from text mining CEO messages in sustainability reports. The dataset, last updated in May 2026, is provided as an Excel file under a CC-BY-4.0 license.
88 German cities with populations over 100,000 are analyzed to explain OpenStreetMap updating dynamics. The dataset and code support a framework integrating XGBoost, SHAP explainability, and clustering to identify non-linear interactions between urban characteristics. Author Chuan Chen published the repository on figshare under a CC-BY-4.0 license, last updated on 2026-05-10.
Australia's ocean region from 70°E to 170°W and 20°N to 70°S is covered by this sea surface temperature (SST) product. It is a Level 3C (L3C) single-day average derived from daytime passes of the AVHRR instrument on NOAA-19 satellites, gridded at 0.02-degree resolution. The Australian Ocean Data Network provides this data, with referenced accuracy metrics from 2014.
Beginning in 2013, this dataset contains the winning numbers and prize amounts for the New York Lottery's Quick Draw game. The data includes details for each draw, such as the date, time, and number sequence. Columns suggest it can be used to analyze draw frequency, prize distribution, and number patterns.