136,414 snapshots of phishing and malware websites captured by a headless Chromium browser between July 24 and August 15, 2024. Each row contains the full HTML source, extracted visible text, complete network traffic from HAR recording, parsed page features, and resource fingerprints. The dataset was created by author nyuuzyou and last updated on the platform in March 2026.
Use Cases
- Train classifiers to detect phishing sites based on HTML structure and content.
- Analyze malicious network traffic patterns based on the HAR recordings.
- Develop feature extraction models for security alerts using parsed page features.
- Study the evolution of phishing techniques based on snapshots from a specific time period.
Strengths
- 136,414 confirmed or high-confidence phishing/malware website snapshots.
- Contains multiple data modalities per snapshot: HTML, text, network traffic, and features.
- Snapshots were captured within a focused 23-day period in July-August 2024.
Limitations
- Description metadata is limited; actual data quality requires manual inspection after download.
- Column-level documentation is absent; field semantics must be inferred after download.
- Some hosts were already blocked or taken down when the snapshot was taken, which may affect completeness.
Provenance
- Source
- huggingface
- Collection Method
- Captured by a headless Chromium browser.
- Time Range
- July 24 to August 15, 2024.
- Freshness
- Last updated 2026-03-12 02:03:38; freshness should be verified.
- Geography
- null