MARM is a dataset supporting a framework for predicting malignancy risk from host-derived copy number variation (CNV) features extracted from bronchoalveolar lavage fluid (BALF) metagenomic next-generation sequencing (mNGS) data. The dataset was authored by Zhili Chang and last updated on 2026-05-21. It contains results from evaluating models like XGBoost, Random Forest, and GLM, with the best model achieving a sensitivity of 0.686, specificity of 0.975, accuracy of 0.847, and Youden index of 0.671 on an independent validation set.
Use Cases
- Training malignancy risk classifiers based on host-derived copy number variation (CNV) features.
- Benchmarking machine learning models (XGBoost, Random Forest, GLM) on genomic features for clinical prediction.
- Comparing the predictive performance of host-derived CNV signals versus microbial features in admixed mNGS data.
- Developing pseudo-label extension strategies to incorporate weakly labeled samples into model training.
Strengths
- Model performance metrics are explicitly provided, including a specificity of 0.975 and accuracy of 0.847.
- The dataset is associated with a published methodological framework (MARM) with a clear technical description.
- It is licensed under CC-BY-4.0, allowing for open reuse and modification.
Limitations
- Column-level documentation is absent; field semantics must be inferred after download.
- Row count is unknown, which may limit suitability assessment for large-scale modeling.
- The dataset is small at 205.5 KB, indicating limited scope, likely containing model results or feature summaries rather than raw sequencing data.
Provenance
- Source
- figshare, authored by Zhili Chang.
- Collection Method
- Host-derived CNV features were extracted from BALF mNGS data through genome-wide window-based coverage quantification, normalization, bias correction, and principal component-based denoising.
- Freshness
- Last updated 2026-05-21 04:37:55; freshness should be verified.