Name: Enhancing Diabetes Risk Stratification Through NLP: A Multimodal Data Integration Approach
Creator: Yaoyan Lu
Published: 2026-05-28T06:18:35
License: CC-BY-4.0
Keywords: Clinical Nlp, Predictive Modeling, Benchmark, Healthcare, Natural Language Processing, Time Series, Multimodal Health, Diabetes Risk, Multimodal

Description

A research dataset of 1,879 individuals used to investigate integrating natural language processing with traditional clinical data for type 2 diabetes risk prediction. The dataset includes structured variables like BMI and HbA1c alongside unstructured textual entries such as symptom descriptions and lifestyle notes. It was created by Yaoyan Lu and last updated on 2026-05-28.

Use Cases

Training hybrid risk prediction models based on structured clinical variables and NLP-derived features.
Benchmarking NLP pipelines for extracting latent risk factors from unstructured clinical notes.
Evaluating model generalizability across different machine learning classifiers as described in the study.
Conducting temporal validation studies on clinical prediction models using a post-2020 cohort.

Strengths

Dataset size of 1,879 individuals provides a substantive sample for model development.
Performance metrics for the integrated model are reported, including an AUC-ROC of 0.92 and accuracy of 88.2%.
Validation procedures were rigorous, including bootstrap confidence intervals, sensitivity analysis, and temporal validation on a separate cohort of 939 individuals.
The study tested generalizability across four different classifiers (logistic regression, random forest, XGBoost, neural networks).

Limitations

Column-level documentation is absent; field semantics must be inferred after download.
Row count for the primary and validation cohorts is known, but the specific number of features or text entries per individual is not detailed.
The 859.5 KB file size suggests the primary data artifact is a research paper (DOCX), implying the underlying dataset files may not be directly included or are very small.

Provenance

Source: figshare
Collection Method: Analyzed a public dataset; specific original source not named.
Time Range: Includes a post-2020 validation cohort, suggesting data spans multiple years.
Freshness: Last updated 2026-05-28 06:18:35; freshness should be verified.

The primary file format is DOCX, which likely contains the research manuscript; the actual underlying dataset files may need to be located separately.

Time Series Multimodal Clinical Nlp Predictive Modeling Benchmark Healthcare Natural Language Processing Multimodal Health Diabetes Risk

Enhancing Diabetes Risk Stratification Through NLP: A Multimodal Data Integration Approach

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info