Name: Unified Corpus of Seven Phishing and Spam Email Datasets
Creator: puyang2025
Published: 2026-01-08T21:50:52
Keywords: Email Security, Text, Tabular, Natural Language Processing, Spam Classification, Phishing Detection, Text Corpus

Description

Puyang2025's dataset provides a unified, row-level email corpus built from seven commonly used public email datasets for phishing and spam detection research. Each row contains the email body text, optional header-like fields, a source dataset identifier, and a binary label. The dataset was last updated on HuggingFace in January 2026.

Use Cases

Train binary classifiers to predict the `label` (phishing/spam vs. legitimate) using features from the `email body` text.
Analyze stylistic patterns in the `email body` across different source `dataset_name` groups to identify dataset-specific biases.
Build multi-source models that leverage the unified structure to improve generalization across the seven constituent datasets.
Extract and engineer features from optional `header-like fields` (e.g., sender, date) to enhance classification performance.

Strengths

Unifies seven distinct public email datasets into a single corpus for comparative analysis.
Provides a consistent `label` field for binary classification tasks across all source data.

Limitations

Specific row counts, column details, and sample sizes for the seven source datasets are not provided.
The recency and geographic representativeness of the constituent email datasets are unknown.

Provenance

Source: HuggingFace user puyang2025, aggregating from seven unnamed public email datasets.
Collection Method: Row-level unification of existing public datasets.
Time Range: null
Freshness: Last updated on the platform in January 2026.
Geography: null

null

Text Tabular Email Security Natural Language Processing Spam Classification Phishing Detection Text Corpus

Unified Corpus of Seven Phishing and Spam Email Datasets

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info