Name: Phishing and Legitimate Webpage Features for Machine Learning, 2015-2017
Creator: Shashwat Tiwari
License: CC-BY-4.0
Keywords: Machine Learning, Web Security, Cybersecurity, Benchmark, Tabular, Finance, Phishing Detection

Description

10,000 webpages (5,000 phishing and 5,000 legitimate) downloaded between January-May 2015 and May-June 2017 are represented by 48 extracted features. The dataset was created by Shashwat Tiwari using an improved feature extraction technique leveraging the Selenium WebDriver browser automation framework. It originates from research by Tan, Choon Lin (2018) and is shared under a CC-BY-4.0 license.

Use Cases

Benchmarking phishing webpage classification models based on 48 extracted features.
Analyzing the effectiveness of different webpage features for phishing detection.
Conducting rapid proof-of-concept experiments for new anti-phishing techniques.
Evaluating feature extraction methods, such as browser automation versus parsing.

Strengths

Contains a balanced set of 5,000 phishing and 5,000 legitimate webpage instances.
Features were extracted using an improved, precise technique based on the Selenium WebDriver framework.
Clear provenance linking to the original research dataset with a DOI.

Limitations

Column-level documentation is absent; field semantics must be inferred after download.
Last update date is unknown; freshness unverified.
Row count is known, but specific file formats and sample data are unavailable for preview.

Provenance

Source: Tan, Choon Lin (2018), 'Phishing Dataset for Machine Learning: Feature Evaluation', Mendeley Data, V1, doi: 10.17632/h3cgnj8hft.1
Collection Method: Features extracted from downloaded webpages using Selenium WebDriver browser automation.
Time Range: Webpages downloaded from January to May 2015 and from May to June 2017.
Freshness: Data collection ended in 2017; last platform update date is unknown.
Geography: Likely global, but geographic coverage is not specified in the description.

The 'id' column was removed from the dataset as irrelevant for analysis.

Tabular Machine Learning Web Security Cybersecurity Benchmark Finance Phishing Detection

Phishing and Legitimate Webpage Features for Machine Learning, 2015-2017

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info