Loading...
Loading...
General ML benchmarks, tabular data, AutoML, recommendation systems, anomaly detection, evaluation suites
141,962 datasets
Haoran Zhu published comparative experimental results for a Heterogeneous Biological Graph Convolutional Network (HBGCN) on 2026-05-19. The dataset, available on figshare, supports the evaluation of a method for drug-target interaction prediction. The source code and dataset are hosted on GitHub.
A list of ten predicted drug candidates for autistic disorder generated by a Heterogeneous Biological Graph Convolutional Network (HBGCN). The model integrates multimodal biological information to predict drug-target interactions. The dataset was created by Haoran Zhu and last updated on 2026-05-19.
A 5.5 KB Excel file containing a list of ten genes predicted to interact with the drug Tamoxifen. The data was generated by the Heterogeneous Biological Graph Convolutional Network (HBGCN) model, authored by Haoran Zhu and last updated on May 19, 2026. The model integrates multimodal biological information to predict drug-target interactions.
A list of ten proteins predicted to interact with the antipsychotic drug Clozapine, generated by the Heterogeneous Biological Graph Convolutional Network (HBGCN) model. The dataset is provided by author Haoran Zhu and was last updated on May 19, 2026. It is a small dataset, 5.5 KB in size, shared under a CC-BY-4.0 license.
Zeshan Ali's research dataset from 2026 investigates the thermal stability of a fermented beverage. The dataset likely contains measurements of phytochemical content, antioxidant activity, sensory scores, and volatile profiles for a beverage stored at 25°C, 40°C, and 50°C for 2 months. Random Forest modeling and kinetic analysis were applied to predict shelf-life.
A study investigates the thermal stability of a vinegar-based beverage formulated with red date vinegar, goji berry juice, and honey. The dataset includes results from phytochemical analysis, antioxidant activity tests, sensory evaluation, and electronic nose volatile profiling across three storage temperatures over two months. It was authored by Zeshan Ali and last updated on 2026-05-18.
Historical data from the Sistema de Información Red de Desaparecidos y Cadáveres (SIRDEC) on persons reported missing in Colombia from 1930 to April 2026. The dataset includes columns for demographic details, disappearance context, and geographic location. It was last updated on June 4, 2026, via the datos.gov.co platform.
Nathan Fortier provides performance metrics comparing the original SpliceAI tool with two open-source implementations and a legacy ensemble baseline. The dataset includes results from six evaluation sets, including a curated set of 1,316 validated variants and ClinVar-derived datasets comprising over 111,000 variants. It was last updated on May 13, 2026.
Performance metrics for six benchmark datasets used to evaluate SpliceAI and its open-source implementations. The data includes results from 1,316 validated variants, 213 variants with splice-assay data, 99,601 variants from the SPiP study, 242 deep intronic pathogenic variants, and two ClinVar-derived datasets totaling over 111,000 variants. Authored by Nathan Fortier and last updated in May 2026, the dataset is shared under a CC-BY-4.0 license.
Performance metrics compare the original SpliceAI tool with two open-source implementations and a legacy ensemble baseline across six variant datasets. The data includes results for 1,316 validated variants, 213 splice-assay variants, 99,601 variants from the SPiP study, 242 deep intronic pathogenic variants, and over 111,000 ClinVar-derived variants. Nathan Fortier published this benchmark on figshare in May 2026.
1,316 validated variants and five other datasets totaling over 200,000 variants were used to benchmark SpliceAI and its open-source reimplementations. The dataset, created by Nathan Fortier and last updated in May 2026, compares the original SpliceAI with OpenSpliceAI, CI-SpliceAI, and a legacy ensemble baseline. It includes performance metrics like balanced accuracy and splice-site match rates across different variant classes.
Six datasets comprising 213,136 genetic variants were used to benchmark splice-altering variant prediction tools. The data compares the original SpliceAI algorithm against two open-source reimplementations and a legacy ensemble baseline. Authored by Nathan Fortier and last updated on 2026-05-13, the dataset is hosted on figshare under a CC-BY-4.0 license.
Medical Information Mart for Intensive Care (MIMIC-III/IV) and eICU Collaborative Research Database data from 81,876 and 140,237 ICU admissions, respectively. The dataset contains baseline characteristics for a study testing a deep-learning anomaly signal on creatinine-eGFR time series to predict near-term kidney replacement therapy and mortality. Author Yoonjin Kang published the data under a CC-BY-4.0 license in May 2026.
Supplementary file 1_Integrating pretreatment CT radiomics and circulating tumor cells using machine learning to predict survival in hepatocellular carcinoma.docx is a research document describing a multimodal prognostic model for advanced hepatocellular carcinoma. The model integrates clinical variables, CT radiomic features, and circulating tumor cell counts, developed by Yongzhong Li and last updated on 2026-05-19. It includes internal and external validation results, reporting a concordance index of 0.789 and AUCs for 1-, 2-, and 3-year overall survival.
20 semesters of anonymized data on new student loan beneficiaries from ICETEX, covering periods from 2015-1 to 2025-1. The dataset includes columns for funding source, gender, education level, credit modality, and department of origin. It is published by www.datos.gov.co and was last updated on 2026-05-26.
Kelly Labart's guide from FERDI Research Data presents and critiques major indicators of ethnolinguistic fragmentation, which are frequently used to measure an impediment to development. The guide addresses source and methodological problems associated with these indicators and questions their exogenous nature by exploring correlations with geography. The dataset is available as an Excel file and is published under an open license.
Jiaru Liu's research dataset, last updated in May 2026, identifies a nine-gene lipid–immune signature linked to comorbidity between rheumatoid arthritis and major depressive disorder. The 2.4 MB file contains results from integrated transcriptomic analysis of five MDD and five RA cohorts, validated with in vitro experiments. The dataset is shared under a CC-BY-4.0 license on figshare.
GNATS provides an 8-year monthly time series of carbon cycle parameters across the Gulf of Maine. The dataset combines shipboard measurements of POC, PIC, DOC, primary productivity, and hydrographic data with concurrent satellite ocean color and SST observations. It is produced by NASA to study marine and terrestrial carbon fluxes in a productive shelf sea.
NASA's Crustal Dynamics Data Information System archives daily broadcast ephemeris files from a global network of ground receivers tracking multiple satellite constellations. Since 2011, the archive has expanded beyond GPS and GLONASS to include Europe's Galileo, China's Beidou, and other global navigation systems. Each daily file contains broadcast navigation data in the standard RINEX format for a single ground station.
NASA's Crustal Dynamics Data Information System archives 1-second sampled, sub-hourly files of ground-based GNSS observation and broadcast ephemeris data from a global network. The dataset includes data from multiple global navigation systems, including GPS, GLONASS, Galileo, Beidou, QZSS, IRNSS, and SBAS, with coverage expanding since 2011. Each file contains 15 minutes of data in the standard RINEX format from individual receiver sites.