Sign in to view source links and access this dataset
Description
SOREL-20M Subset Dataset contains 196,534 samples for malware detection, created by reveng-grp-2025. It includes 99,506 malicious and 97,028 benign samples, each described by 2,351 EMBER v2 features. The dataset was last updated on Hugging Face on 2025-06-02.
Use Cases
Training binary classifiers for malware detection based on EMBER v2 features.
Benchmarking feature extraction and model performance on a large-scale malware dataset.
Analyzing the distribution and characteristics of malicious versus benign software samples.
Strengths
Contains 196,534 total samples, providing a substantial dataset for model training.
Includes a balanced split with 99,506 malicious and 97,028 benign samples.
Features are derived from the EMBER v2 feature set, a standard in static malware analysis.
Limitations
Row count and file size are unknown, which may limit suitability assessment.
Column-level documentation is absent; field semantics must be inferred after download.
Description metadata is limited; actual data quality requires manual inspection after download.
Provenance
Source
reveng-grp-2025 on Hugging Face.
Collection Method
Subset of the larger SOREL-20M dataset.
Freshness
Last updated 2025-06-02 11:38:12.
License is unknown; users must verify terms of use before downloading.