Puyang2025's dataset provides a unified, row-level email corpus built from seven commonly used public email datasets for phishing and spam detection research. Each row contains the email body text, optional header-like fields, a source dataset identifier, and a binary label. The dataset was last updated on HuggingFace in January 2026.
Use Cases
- Train binary classifiers to predict the `label` (phishing/spam vs. legitimate) using features from the `email body` text.
- Analyze stylistic patterns in the `email body` across different source `dataset_name` groups to identify dataset-specific biases.
- Build multi-source models that leverage the unified structure to improve generalization across the seven constituent datasets.
- Extract and engineer features from optional `header-like fields` (e.g., sender, date) to enhance classification performance.
Strengths
- Unifies seven distinct public email datasets into a single corpus for comparative analysis.
- Provides a consistent `label` field for binary classification tasks across all source data.
Limitations
- Specific row counts, column details, and sample sizes for the seven source datasets are not provided.
- The recency and geographic representativeness of the constituent email datasets are unknown.
Provenance
- Source
- HuggingFace user puyang2025, aggregating from seven unnamed public email datasets.
- Collection Method
- Row-level unification of existing public datasets.
- Time Range
- null
- Freshness
- Last updated on the platform in January 2026.
- Geography
- null