Name: ProbioSML: 1,072 Machine Learning-Derived Genomic Features from Probiotic Bacteria
Creator: Diego Lucas Neres Rodrigues
Published: 2026-04-22T05:46:21
License: CC-BY-4.0
Keywords: Machine Learning, Pangenomics, Microbial Genomics, Healthcare, Tabular, Probiotics, Large Scale, Genomic Features, Synthetic

Description

1,072 non-redundant protein-coding sequences form a genomic dataset derived from comparative analyses of bacterial genomes. The ProbioSML dataset, created by Diego Lucas Neres Rodrigues and released in 2026, was generated using pangenomic analysis combined with supervised machine learning approaches like Random Forest and Support Vector Machine. It includes gene presence-absence matrices and functional annotations for taxa frequently reported as probiotics and reference gut-associated bacteria.

Use Cases

Exploratory analysis of genomic patterns associated with probiotic taxa based on the 1,072 discriminative features.
Comparative genomics studies between probiotic and reference gut bacteria based on the provided gene presence-absence matrices.
Benchmarking machine learning methods for feature extraction in genomics based on the described Random Forest, SVM, and Logistic Regression approaches.
Investigating taxonomic and ecological signatures in bacterial genomes based on the dataset's functional annotations.

Strengths

Contains 1,072 non-redundant protein-coding sequences, providing a specific set of discriminative genomic features.
Includes functional annotations and gene presence-absence matrices, adding interpretability to the genomic data.
All data and scripts are publicly available in an open-access repository, supporting reproducibility.
Released under a CC-BY-4.0 license, permitting broad reuse and modification.

Limitations

Column-level documentation is absent; field semantics must be inferred after download.
Row count is unknown, which may limit suitability assessment for certain modeling tasks.
The dataset is small at 167.3 KB, indicating limited scope and likely a feature set rather than raw genomic sequences.

Provenance

Source: Diego Lucas Neres Rodrigues via figshare and Zenodo (DOI: 10.5281/zenodo.14181443).
Collection Method: Generated from comparative pangenomic analysis of bacterial genomes combined with supervised machine learning (Random Forest, SVM, Logistic Regression).
Time Range: null
Freshness: Last updated 2026-04-22 05:46:21; freshness should be verified.
Geography: null

Primary file format is PDF, which may require extraction or conversion for direct computational use.

Tabular Machine Learning Pangenomics Microbial Genomics Healthcare Probiotics Large Scale Genomic Features Synthetic

ProbioSML: 1,072 Machine Learning-Derived Genomic Features from Probiotic Bacteria

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info