Name: MARM: Malignancy Risk Prediction from Host CNV in Bronchoalveolar Lavage mNGS Data
Creator: Zhili Chang
Published: 2026-05-21T04:37:55
License: CC-BY-4.0
Keywords: Copy number variation, CSV, Benchmark, Healthcare, Tabular, Bronchoalveolar Lavage, Clinical Decision Support, Malignancy Risk Prediction

Description

MARM is a dataset supporting a framework for predicting malignancy risk from host-derived copy number variation (CNV) features extracted from bronchoalveolar lavage fluid (BALF) metagenomic next-generation sequencing (mNGS) data. The dataset was authored by Zhili Chang and last updated on 2026-05-21. It contains results from evaluating models like XGBoost, Random Forest, and GLM, with the best model achieving a sensitivity of 0.686, specificity of 0.975, accuracy of 0.847, and Youden index of 0.671 on an independent validation set.

Use Cases

Training malignancy risk classifiers based on host-derived copy number variation (CNV) features.
Benchmarking machine learning models (XGBoost, Random Forest, GLM) on genomic features for clinical prediction.
Comparing the predictive performance of host-derived CNV signals versus microbial features in admixed mNGS data.
Developing pseudo-label extension strategies to incorporate weakly labeled samples into model training.

Strengths

Model performance metrics are explicitly provided, including a specificity of 0.975 and accuracy of 0.847.
The dataset is associated with a published methodological framework (MARM) with a clear technical description.
It is licensed under CC-BY-4.0, allowing for open reuse and modification.

Limitations

Column-level documentation is absent; field semantics must be inferred after download.
Row count is unknown, which may limit suitability assessment for large-scale modeling.
The dataset is small at 205.5 KB, indicating limited scope, likely containing model results or feature summaries rather than raw sequencing data.

Provenance

Source: figshare, authored by Zhili Chang.
Collection Method: Host-derived CNV features were extracted from BALF mNGS data through genome-wide window-based coverage quantification, normalization, bias correction, and principal component-based denoising.
Freshness: Last updated 2026-05-21 04:37:55; freshness should be verified.

Tabular CSV Copy number variation Benchmark Healthcare Bronchoalveolar Lavage Clinical Decision Support Malignancy Risk Prediction

MARM: Malignancy Risk Prediction from Host CNV in Bronchoalveolar Lavage mNGS Data

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info