Phishing and Legitimate Webpage Features for Machine Learning, 2015-2017
by Shashwat Tiwari
arff
Available on 1 platform
Sign in to view source links and access this dataset
Description
10,000 webpages (5,000 phishing and 5,000 legitimate) downloaded between January-May 2015 and May-June 2017 are represented by 48 extracted features. The dataset was created by Shashwat Tiwari using an improved feature extraction technique leveraging the Selenium WebDriver browser automation framework. It originates from research by Tan, Choon Lin (2018) and is shared under a CC-BY-4.0 license.
Use Cases
Benchmarking phishing webpage classification models based on 48 extracted features.
Analyzing the effectiveness of different webpage features for phishing detection.
Conducting rapid proof-of-concept experiments for new anti-phishing techniques.
Evaluating feature extraction methods, such as browser automation versus parsing.
Strengths
Contains a balanced set of 5,000 phishing and 5,000 legitimate webpage instances.
Features were extracted using an improved, precise technique based on the Selenium WebDriver framework.
Clear provenance linking to the original research dataset with a DOI.
Limitations
Column-level documentation is absent; field semantics must be inferred after download.
Last update date is unknown; freshness unverified.
Row count is known, but specific file formats and sample data are unavailable for preview.
Provenance
Source
Tan, Choon Lin (2018), 'Phishing Dataset for Machine Learning: Feature Evaluation', Mendeley Data, V1, doi: 10.17632/h3cgnj8hft.1
Collection Method
Features extracted from downloaded webpages using Selenium WebDriver browser automation.
Time Range
Webpages downloaded from January to May 2015 and from May to June 2017.
Freshness
Data collection ended in 2017; last platform update date is unknown.
Geography
Likely global, but geographic coverage is not specified in the description.
The 'id' column was removed from the dataset as irrelevant for analysis.