DataSalon

Discover quality datasets for AI training — aggregated from 40+ platforms, curated by AI.

ProductSearch Datasets Browse Topics Rankings Community API / MCP

ResourcesDocumentation Blog Changelog Status

LegalPrivacy Policy Terms of Service Cookie Policy

Software Engineering & Security Datasets | DataSalon

All Categories

🔒

Software Engineering & Security

Source code corpora, bug reports, vulnerability databases, network intrusion detection, malware samples

1,591 datasets

Phishing Email Classification Dataset for LLM Fine-Tuning

Comprising labeled emails for phishing detection, with each row classified as a safe email (label=0) or a phishing email (label=1). It includes metadata such as sender, receiver, date, and subject, along with a cleaned email body. The dataset is curated for fine-tuning large language models on this classification task.

ParquetSize Categories10 Kn100 KLibrarypolarsModalitytextLibrarymlcroissantLibrarydatasetsLibrarypandasRegionus+1

0 views

Software Engineering & Security

City of Tempe Cybersecurity Framework Scores by Quarter

The City of Tempe provides Cybersecurity Framework (CSF) scores for each CSF category per fiscal year quarter. The data is used to measure and report on the city's internal cybersecurity program maturity, based on the NIST framework's five functions: identify, protect, detect, respond, and recover.

Cybersecurity Pm 5 12Hacker Detection And PreventionCybersecurityPerformance MeasuresProtect AssetsFinancial Stability And VitalityCustomer PrivacyInfosecCyber SecurityInformation SecurityRegulatory ComplianceHarden Critical Infrastructure AssetsPrevent Data Breach+1

0 views

Software Engineering & Security

Monkey Species Image Collection for Fine-Grained Classification

Featuring nearly 1400 JPEG images of 10 monkey species, organized into training and validation splits. It was created by Lehrig as a test case for fine-grained classification tasks, with images sourced from Wikipedia using the googliser tool.

Regionus+1

0 views

Software Engineering & Security

Maryland Groundwater Observation Well Locations from 2001

MDNET provides point locations and names for a network of groundwater observation wells across the state of Maryland. The dataset was created for use within Geographic Information Systems by the organization CEOS_EXTRA and was last updated in 2001. It serves as a spatial index, with detailed water condition data available through a linked U.S. Geological Survey database.

GeospatialGeospatial PointsObservation WellsGroundwaterMaryland+1

0 views

Software Engineering & Security

HomeRun Graph and Hypergraph Null Model Replication Data

Replication data for the HomeRun algorithm, which performs curveball trades in streaming for fast null modeling of graphs, hypergraphs, and binary matrices. The dataset was authored by Matteo Riondato and last updated in January 2026. Specific details on the data volume and structure are unavailable.

Computer and Information Science+1

0 views

Software Engineering & Security

MH-100K: 101,975 Android Applications for Malware Detection Research (2010-2022)

101,975 Android application packages (APKs) collected between 2010 and 2022. The dataset provides high-dimensional tabular data from static analysis, including permissions and API calls, for studying malware evolution. It was created by author 'hendriow' and hosted on Hugging Face.

TabularMalwareLanguageenSecurityTabular ClassificationCybersecuritySize Categories100 Kn1 MAndroid MalwareLicensecc By 40Task CategoriesotherAndroidRegionusLarge ScaleStatic AnalysisInt8Task Categoriestabular Classification+1

0 views

Software Engineering & Security

USMCA Investment Commitments Gravity Analysis Data

US International Trade Commission data supports the gravity analysis for investment commitments under the U.S.-Mexico-Canada Trade Agreement. The dataset was used in USITC Publication 4889 to model the agreement's economic impact. It provides the empirical foundation for the investment analyses detailed in appendix J of the official report.

U S Mexico Canada Trade AgreementUsmca+1

0 views

Software Engineering & Security

Cybersecurity Vulnerability Records from ExploitDB

Featuring 70,233 structured records of cybersecurity vulnerabilities and exploits sourced from ExploitDB. It is processed for machine learning and security research applications, with data last updated in June 2025.

ParquetSize Categories10 Kn100 KTask Categoriestext GenerationLibrarypolarsTask Categoriesquestion AnsweringLanguageenSecurityCybersecurityModalitytextLibrarymlcroissantLibrarydatasetsLibrarypandasRegionusTask Categoriestext ClassificationLanguageruCveExploitLicensemitVulnerability+1

0 views

Software Engineering & Security

Long Island Coastal Orthophoto Mosaic Tiles from 2002

NOAA National Ocean Service provides 1,541 true color orthorectified image tiles covering Long Island, New York. The dataset contains 100 source images mosaicked into 1000m by 1000m GeoTIFF tiles with a 0.5-meter pixel resolution. Data was produced in June 2002 for a USGS benthic mapping contract.

ImageGeospatialCoastal ImageryBenthic MappingComputer VisionOrthophoto mosaicGeospatial Imagery+1

0 views

Software Engineering & Security

Antarctic Snow Transport and Atmospheric Turbulence Measurements Near Syowa Station

Measurement and simulation data characterize standard meteorology, turbulence, and snow transport at the S17 site near Syowa Station in East Antarctica during the austral summer of 2018/2019. An automatic station recorded data from January 10 to 26, 2019, equipped with sensors including a 3D ultrasonic anemometer and a snow particle counter. The dataset also includes large-eddy simulations of two 10-minute intervals and remote sensing data from a tilted Micro Rain Radar.

TabularTime SeriesGeospatialAntarctic MeteorologyLarge Eddy SimulationSnow TransportAtmospheric Turbulence+1

0 views

Software Engineering & Security

University Student Ethical Judgment Survey of 659 Madrid Students

659 university students in Madrid responded to a survey on dishonest behaviors like cheating and plagiarism. The database was created to study sensitivity towards self-committed and observed dishonest actions, evaluating ethical judgment on severity and blame. Multivariate statistical methods, including K-means cluster analysis, were used to classify individuals into profiles based on their judgment and tendency to commit dishonest acts.

TabularSurvey DataUniversity StudentsAcademic Ethics+1

0 views

Software Engineering & Security

MultiLang Code Parser Dataset: Parsed Source Code Across 10 Languages

MultiLang-Code-Parser-Dataset (MLCPD) provides a large-scale, unified dataset of parsed source code across 10 major programming languages. Each entry corresponds to a parsed source file and includes language metadata, code-level statistics, and a universal schema JSON representation. The dataset was created by jugalgajjar and last updated on October 23, 2025.

TabularSource CodeLarge ScaleAbstract Syntax TreeProgramming Languages+1

0 views

Software Engineering & Security

Australian Antarctic Heritage Funding Indicator, 2001

The Australian Antarctic Territory and Heard and McDonald Islands are covered by this indicator, which tracked the level of funding provided by the Australian Antarctic Division for heritage expertise. It was designed as an annual response indicator to measure governmental commitment to preserving cultural heritage in Antarctica. The dataset is considered obsolete and was last updated on December 31, 2001.

TabularGovernment FundingEnvironmental IndicatorAntarctic HeritageResponse Indicator+1

0 views

Software Engineering & Security

Defensive Cybersecurity Instruction Pairs for AI Training

2,500 instruction-response pairs provide detailed guidance on information security principles. This dataset is designed to train AI models for defensive cybersecurity education while refusing malicious assistance. The data is in English and formatted as Parquet files.

TextEnglishCybersecurityInformation SecurityDefensive Security+1

0 views

Software Engineering & Security

GitHub Issue and Pull Request Pairs from 12 Python Repositories

2,294 Issue-Pull Request pairs test automated resolution of real-world software problems. The dataset was created by the SWE-bench project to evaluate systems using unit test verification against post-PR behavior. It is sourced from 12 popular Python repositories.

ParquetSize Categories10 Kn100 KLibrarypolarsModalitytextLibrarymlcroissantLibrarydatasetsLibrarypandasRegionusArxiv231006770+1

0 views

Software Engineering & Security

Advanced SIEM Dataset: 100,000 Synthetic Security Event Records

A synthetic dataset of 100,000 security event records designed for training machine learning and artificial intelligence models in cybersecurity. It simulates logs from Security Information and Event Management (SIEM) systems, capturing diverse event types such as firewall activities, intrusion detection system alerts, authentication attempts, endpoint activities, network traffic, and cloud events. The dataset was created by author darkknight25 and last updated on July 11, -2025.

TabularJSONSIEMLibrarypolarsLanguageenCybersecurityModalitytextSize Categories100 Kn1 MModalitytabularLibrarymlcroissantSecurity EventsLibrarydatasetsLibrarypandasRegionusSynthetic DataLicensemitSynthetic+1

0 views

Software Engineering & Security

Phishing and Benign Email Collection for Detection Models

A curated collection of phishing and legitimate emails for cybersecurity applications. The dataset was created by darkknight25 and last updated on May 19, 2025. Each entry includes fields such as subject, body, intent, technique, target, and a classification label.

TextJSONLibrarypolarsLanguageenCybersecuritySize Categoriesn1 KModalitytextLibrarymlcroissantLibrarydatasetsLibrarypandasText ClassificationRegionusPhishing DetectionTask Categoriestext ClassificationLicensemit+1

0 views

Software Engineering & Security

Maryland Permitted Groundwater Withdrawal Sites from 1998

Point locations represent wells in Maryland permitted to withdraw 10,000 gallons or more of groundwater per day. The dataset includes permit details, withdrawal amounts, aquifer codes, and use types. Data was compiled by the Maryland Department of the Environment from the USGS SWUDS database for the year 1998.

GeospatialGeospatial PointsWater UseGroundwater+1

0 views

Software Engineering & Security

All CVE Records: 300,000 Multi-Turn Cybersecurity Conversations (1999-2025)

Approximately 300,000 Common Vulnerabilities and Exposures (CVE) records from 1999 to 2025 are formatted as multi-turn chat conversations in this dataset. Created by AlicanKiraz0, the collection transforms structured vulnerability data into a dialogue format specifically for training cybersecurity AI agents.

JSONTask Categoriestext GenerationLibrarypolarsLanguageenCybersecurityModalitytextSize Categories100 Kn1 MLibrarymlcroissantLibrarydatasetsLibrarypandasRegionusCveLicenseapache 20Vulnerability+1

0 views

Software Engineering & Security

Fenrir V2.0: 83,920 Defensive Cybersecurity Instruction Triples

Fenrir V2.0 contains 83,920 system/user/assistant triples for defensive cybersecurity instruction-tuning, created by Alican Kiraz and updated in October 2025. The collection focuses on alignment-safe training across frameworks like MITRE ATT&CK, NIST CSF, and the OWASP Top 10.

JSONSize Categories10 Kn100 KTask Categoriestext GenerationLibrarypolarsLanguageenCybersecurityModalitytextLibrarymlcroissantLibrarydatasetsLibrarypandasRegionusLicenseapache 20Defensive Security+1

0 views

PreviousPage 61 of 80Next