Loading...
Loading...
Source code corpora, bug reports, vulnerability databases, network intrusion detection, malware samples
1,591 datasets
DDoS dataset likely contains network traffic logs or features related to distributed denial-of-service attacks. It is hosted on Kaggle, a popular platform for data science competitions and projects. The specific source, collection method, and data volume are not detailed in the available metadata.
A malware dataset published on Kaggle. The dataset's specific content, size, and provenance are not detailed in the provided metadata. Further inspection after download is required to confirm its structure and applicability.
IDS Dataset is a Kaggle-hosted collection likely containing network traffic or system logs for security analysis. The dataset's specific size, features, and creation details are not provided in the available metadata. Its content and structure require verification after download.
DatasetPhishing is a dataset hosted on Kaggle. Its specific contents, size, and origin are not detailed in the provided metadata. The dataset likely contains features related to phishing attempts, such as URLs, email headers, or website characteristics.
Phishing email data published on Kaggle. The dataset likely contains examples of deceptive emails used for security analysis. The specific number of emails, collection method, and author are unknown.
Datasetsphishings is a dataset hosted on Kaggle. Its title suggests it contains data related to phishing attacks, a common cybersecurity threat. The dataset's specific contents, size, and origin are not detailed in the available metadata.
A dataset related to phishing, sourced from Kaggle. The specific contents, size, and creation details are unknown from the provided metadata. Further details about the data's origin, time range, and specific features require verification after download.
TIGER-Lab's AceCoder-V2-122k dataset contains over 147,000 programming questions and test cases, an improvement over the V1 version. The dataset was created by rewriting questions and test cases using OpenAI's o1-mini and filtering them with Qwen Coder 2.5 32B Instruct. It was last updated on August 14, 2025.
Two categories of URL strings, malicious and benign, facilitate the detection of harmful online content. The data supports binary classification tasks for web security applications by providing labeled web addresses.
A dataset for viral content propagation and botnet detection. The dataset is tagged as Education and Synthetic. No information is available on its size, structure, or provenance.
Labeled email text samples categorized into phishing and legitimate classes enable text-based security analysis. The data focuses on identifying malicious communication through linguistic patterns and message content. It supports the development of natural language processing models for cybersecurity.
HumanEvalPack extends OpenAI's HumanEval benchmark to cover six programming languages across three tasks. The dataset includes Python, JavaScript, Java, Go, C++, and Rust splits, with non-Python splits translated and cleaned by humans. It was created by the bigcode organization and updated on Hugging Face in August 2025.
Comidds is a survey of host-based and network-based intrusion detection datasets focused on enterprise networks, maintained by fkie-cad. Updated as of January 2026, it aggregates resources for cybersecurity research including netflow and event logs.
A synthetic and labeled dataset of 73,470 Indian SMS messages. It contains labels for OTP detection, OTP intent classification into nine categories, and phishing detection. The dataset was created by gandharvbakshi and was last updated on December 5, 2025.
fkie-cad provides this collection of industrial control system datasets formatted for the Industrial Protocol Abstraction Layer (IPAL) to facilitate intrusion detection system evaluation. Updated in January 2026, the repository aggregates data from several prominent testbeds including SWaT, WADI, and HAI.
A text dataset focused on C++ reverse engineering concepts. The dataset was created by user Aleksandr12314254 and last updated in January 2026. Its specific size and content volume are not detailed.
A collection of source code samples from multiple programming languages, created by author mesolitica and last updated on June 1, 2025. The dataset was generated using the 'Magicoder: Source Code Is All You Need' template, targeting at least 10,000 rows per language. It includes samples for languages such as C++, C#, CUDA, and Dockerfile, sourced from the deduplicated version of The Stack dataset.
CTD sensor data was collected during the BROKE-West voyage of the Aurora Australis in 2006. Measurements include conductivity, temperature, and pressure, logged every second when the Rectangular Midwater Trawl net was in the water. The data was processed and provided by the Australian Antarctic Data Centre.
An aggregated dataset of coding and instruction-following examples designed to train agentic coding models. It was compiled by author 'ethanker' and last updated on November 30, 2025. The dataset likely contains samples from sources including CodeAlpaca-20k, Evol-CodeAlpaca-v1, Code Review Instruct, and APPS.
February 17th, 2008 marks Kosovo's unilateral declaration of independence from Serbia. This Annotation for Transparent Inquiry (ATI) data project examines Kosovo's unique constitution-making process, which involved an internationalized pouvoir constituant following the Kosovo War of 1998-1999 and the 2005-2006 Vienna Talks. The annotated article can be viewed on the publisher's website.