Loading...
Loading...
Source code corpora, bug reports, vulnerability databases, network intrusion detection, malware samples
1,591 datasets
Comprising labeled emails for phishing detection, with each row classified as a safe email (label=0) or a phishing email (label=1). It includes metadata such as sender, receiver, date, and subject, along with a cleaned email body. The dataset is curated for fine-tuning large language models on this classification task.
The City of Tempe provides Cybersecurity Framework (CSF) scores for each CSF category per fiscal year quarter. The data is used to measure and report on the city's internal cybersecurity program maturity, based on the NIST framework's five functions: identify, protect, detect, respond, and recover.
Featuring nearly 1400 JPEG images of 10 monkey species, organized into training and validation splits. It was created by Lehrig as a test case for fine-grained classification tasks, with images sourced from Wikipedia using the googliser tool.
MDNET provides point locations and names for a network of groundwater observation wells across the state of Maryland. The dataset was created for use within Geographic Information Systems by the organization CEOS_EXTRA and was last updated in 2001. It serves as a spatial index, with detailed water condition data available through a linked U.S. Geological Survey database.
Replication data for the HomeRun algorithm, which performs curveball trades in streaming for fast null modeling of graphs, hypergraphs, and binary matrices. The dataset was authored by Matteo Riondato and last updated in January 2026. Specific details on the data volume and structure are unavailable.
101,975 Android application packages (APKs) collected between 2010 and 2022. The dataset provides high-dimensional tabular data from static analysis, including permissions and API calls, for studying malware evolution. It was created by author 'hendriow' and hosted on Hugging Face.
US International Trade Commission data supports the gravity analysis for investment commitments under the U.S.-Mexico-Canada Trade Agreement. The dataset was used in USITC Publication 4889 to model the agreement's economic impact. It provides the empirical foundation for the investment analyses detailed in appendix J of the official report.
Featuring 70,233 structured records of cybersecurity vulnerabilities and exploits sourced from ExploitDB. It is processed for machine learning and security research applications, with data last updated in June 2025.
NOAA National Ocean Service provides 1,541 true color orthorectified image tiles covering Long Island, New York. The dataset contains 100 source images mosaicked into 1000m by 1000m GeoTIFF tiles with a 0.5-meter pixel resolution. Data was produced in June 2002 for a USGS benthic mapping contract.
Measurement and simulation data characterize standard meteorology, turbulence, and snow transport at the S17 site near Syowa Station in East Antarctica during the austral summer of 2018/2019. An automatic station recorded data from January 10 to 26, 2019, equipped with sensors including a 3D ultrasonic anemometer and a snow particle counter. The dataset also includes large-eddy simulations of two 10-minute intervals and remote sensing data from a tilted Micro Rain Radar.
659 university students in Madrid responded to a survey on dishonest behaviors like cheating and plagiarism. The database was created to study sensitivity towards self-committed and observed dishonest actions, evaluating ethical judgment on severity and blame. Multivariate statistical methods, including K-means cluster analysis, were used to classify individuals into profiles based on their judgment and tendency to commit dishonest acts.
MultiLang-Code-Parser-Dataset (MLCPD) provides a large-scale, unified dataset of parsed source code across 10 major programming languages. Each entry corresponds to a parsed source file and includes language metadata, code-level statistics, and a universal schema JSON representation. The dataset was created by jugalgajjar and last updated on October 23, 2025.
The Australian Antarctic Territory and Heard and McDonald Islands are covered by this indicator, which tracked the level of funding provided by the Australian Antarctic Division for heritage expertise. It was designed as an annual response indicator to measure governmental commitment to preserving cultural heritage in Antarctica. The dataset is considered obsolete and was last updated on December 31, 2001.
2,500 instruction-response pairs provide detailed guidance on information security principles. This dataset is designed to train AI models for defensive cybersecurity education while refusing malicious assistance. The data is in English and formatted as Parquet files.
2,294 Issue-Pull Request pairs test automated resolution of real-world software problems. The dataset was created by the SWE-bench project to evaluate systems using unit test verification against post-PR behavior. It is sourced from 12 popular Python repositories.
A synthetic dataset of 100,000 security event records designed for training machine learning and artificial intelligence models in cybersecurity. It simulates logs from Security Information and Event Management (SIEM) systems, capturing diverse event types such as firewall activities, intrusion detection system alerts, authentication attempts, endpoint activities, network traffic, and cloud events. The dataset was created by author darkknight25 and last updated on July 11, -2025.
A curated collection of phishing and legitimate emails for cybersecurity applications. The dataset was created by darkknight25 and last updated on May 19, 2025. Each entry includes fields such as subject, body, intent, technique, target, and a classification label.
Point locations represent wells in Maryland permitted to withdraw 10,000 gallons or more of groundwater per day. The dataset includes permit details, withdrawal amounts, aquifer codes, and use types. Data was compiled by the Maryland Department of the Environment from the USGS SWUDS database for the year 1998.
Approximately 300,000 Common Vulnerabilities and Exposures (CVE) records from 1999 to 2025 are formatted as multi-turn chat conversations in this dataset. Created by AlicanKiraz0, the collection transforms structured vulnerability data into a dialogue format specifically for training cybersecurity AI agents.
Fenrir V2.0 contains 83,920 system/user/assistant triples for defensive cybersecurity instruction-tuning, created by Alican Kiraz and updated in October 2025. The collection focuses on alignment-safe training across frameworks like MITRE ATT&CK, NIST CSF, and the OWASP Top 10.