DataSalon

Discover quality datasets for AI training — aggregated from 40+ platforms, curated by AI.

ProductSearch Datasets Browse Topics Rankings Community API / MCP

ResourcesDocumentation Blog Changelog Status

LegalPrivacy Policy Terms of Service Cookie Policy

Software Engineering & Security Datasets | DataSalon

All Categories

🔒

Software Engineering & Security

Source code corpora, bug reports, vulnerability databases, network intrusion detection, malware samples

1,591 datasets

AI-Authored Commit History from Public GitHub Projects

AgentPack contains 1.3 million GitHub commits identified as co-authored by AI coding agents like Claude Code, OpenAI Codex, and Cursor Agent. The dataset was created by the nuprl research organization and captures commits from public projects between April and mid-August 2025. It provides a large-scale view of how AI agents contribute to real-world software development.

TextTabularJSONLibrarydaskSize Categories1 Mn10 MLlm AgentsModalitytextLibrarymlcroissantSoftware EngineeringLibrarydatasetsGithub CommitsAi Code GenerationRegionusArxiv250921891Licenseapache 20+1

0 views

Software Engineering & Security

Zoning for Herd Protection Against Predation in Aveyron, France

Municipal zoning data for the Aveyron department in France, defining areas eligible for measures to protect domestic herds from wolf predation. The dataset is updated annually by prefectural decree and was last updated on 2019-04-19. It is provided by the Bureau de Recherches Géologiques et Minières (BRGM) via the EU Open Data platform.

GeospatialAgricultural ZoningWolf ProtectionPredation RiskRural Development+1

0 views

Software Engineering & Security

Malicious and Safe URLs: Labeled Database for Machine Learning

This database of malicious and safe URLs was created by SaibaDev and last updated in January 2026. It provides a collection of web addresses categorized for security-related machine learning tasks.

0 views

Software Engineering & Security

SOREL-20M Subset: 196,534 Malware Samples with EMBER v2 Features

SOREL-20M Subset Dataset contains 196,534 samples for malware detection, created by reveng-grp-2025. It includes 99,506 malicious and 97,028 benign samples, each described by 2,351 EMBER v2 features. The dataset was last updated on Hugging Face on 2025-06-02.

TabularMachine LearningCybersecurityMalware DetectionEmber Features+1

0 views

Software Engineering & Security

Poltava City Council Cultural Department Contracts for 2018

Lists of agreements concluded by the Department of Culture of the Executive Committee of the Poltava City Council as of 31.01.2019, 01.03.2019, 01.04.2019, and 01.05.2019. The dataset was last updated on 2019-12-23 and originates from the States site of Ukraine, aggregated via the eu_open_data platform.

TabularFinancial TransparencyUkraineLocal GovernmentPublic Contracts+1

0 views

Software Engineering & Security

Nigerian Telecom Cybersecurity Incident Logs with 30,000 Events

Nigerian cybersecurity incident logs containing 30,000 security events on telecom infrastructure. The dataset includes 14 columns and was generated on 2025-10-05 by electricsheepafrica.

TabularTelecommunicationsCybersecurityIncident LogsNigeriaSynthetic+1

0 views

Software Engineering & Security

Commit Message Edits by Experts for AI-Generated Text Evaluation

JetBrains-Research provides a dataset of expert-labeled edits to AI-generated commit messages. The data was created by presenting labelers with GPT-4 generated messages for 15 commits from the CMG benchmark and asking them to manually edit the messages to a quality suitable for version control systems. The dataset was last updated on 2024-10-17.

TextCommit MessagesAi EvaluationSoftware EngineeringBenchmarkCode GenerationSynthetic+1

0 views

Software Engineering & Security

CVEfixes: 12,987 Vulnerability Fix Records with Code Diffs

12,987 vulnerability fix records from CVEfixes v1.0.8, covering 11,726 unique CVEs across 4,205 software repositories. The dataset includes CVE metadata such as descriptions, CVSS scores, and CWE classifications, alongside git commit data and code diffs. It was uploaded by author hitoshura25 to Hugging Face and last updated on October 14, 2025.

TabularSoftware SecuritySecurity VulnerabilitiesCode DiffsCve+1

0 views

Software Engineering & Security

Humboldt River Main Stem Geometry from 1994 Orthophotos

River geometry data for the main stem of the Humboldt River in Nevada, as defined by U.S. Geological Survey personnel. The dataset was digitized on-screen using 1994 digital orthophoto quadrangles (DOQs) and underwent a rigorous multi-level quality-control process. It was created by the U.S. Geological Survey Nevada District in 2001 for GIS-based river mile calculations.

GeospatialHydrologyGeospatial VectorNevadaRiver Geometry+1

0 views

Software Engineering & Security

SENTINEL-2 Leaf Area Index (LAI) Tiles for Vegetation Analysis

SENTINEL-2 satellite-derived Leaf Area Index (LAI) tiles, produced by FEDEO and hosted on NASA EarthData. LAI measures half the developed area of the convex hull wrapping green canopy elements per unit ground, including contributions from understory layers. The dataset was last updated on December 31, 2021.

GeospatialForest CanopySatellite ImageryLeaf Area IndexVegetation+1

0 views

Software Engineering & Security

PROSUR Workshop on Human Dimensions of Floods in Mercosur

A 2001 workshop in Buenos Aires convened over fifty participants from Argentina, Brazil, Paraguay, Uruguay, and the USA. Organized by scientists Silvina Solman and Matilde Rusticucci under the PROSUR network, it focused on socio-economic vulnerability to climate variability. The metadata describes the event's context and includes a list of participants and related documents.

TextClimate ChangeHealthcareWorkshop ProceedingsHUMAN DIMENSIONSMercosur RegionInterdisciplinary Research+1

0 views

Software Engineering & Security

Regulatory Acts of the Kamenetz-Podolsk City Council Executive Committee

A list of current regulatory acts from the Executive Committee of the Kamenetz-Podolsk City Council in Ukraine, last updated on 2025-04-15. The dataset includes information on the acts themselves and details on basic, repeated, and periodic tracking. It is provided by the States site of Ukraine in an Excel XLSX format.

TabularLegal DocumentsGovernment RegulationsPolicy TrackingUkraineLocal Government+1

0 views

Software Engineering & Security

Godot GDscript Code from 5,000+ GitHub Repositories

Over 5,000 GitHub repositories provide the source for this GDScript code dataset. It was created by wallstoneai in June 2025, with each repository's code and README text consolidated into a single file. The dataset was last updated on the Hugging Face platform in August 2025.

TextTask Categoriestext GenerationGdscriptLanguageenModalitytextSize Categories100 Kn1 MLibrarymlcroissantLibrarydatasetsGodotCode GenerationRegionusSoftware DevelopmentLicenseapache 20Text Corpus+1

0 views

Software Engineering & Security

National Vulnerability Database of Software Flaws and Security Checklists

The National Vulnerability Database (NVD) is the U.S. Government repository of security automation data. It provides a standards-based foundation for automating vulnerability management and compliance, supporting efforts based on the Security Content Automation Protocols (SCAP). The database includes listings of publicly known software flaws, security configuration checklists, product names, and impact metrics.

Cvss800-53ChecklistsNvdCveScapVulnerability+1

0 views

Software Engineering & Security

SecRepoBench: 318 Secure Code Completion Tasks from 27 C/C++ Repositories

318 code completion tasks obtained from 27 popular GitHub C/C++ repositories covering 15 Common Weakness Enumerations (CWEs). The benchmark, created by ai-sec-lab and last updated in November 2025, is built upon the ARVO dataset and is designed to evaluate large language models and agent frameworks for secure code generation.

TextSoftware SecurityBenchmarkSecure CodeC Cpp+1

0 views

Software Engineering & Security

CoreCodeBench-Multi: Multi-Testcase Code Generation Benchmark

CoreCodeBench-Multi is a dataset for benchmarking code generation models, created by meituan-longcat and last updated on May 15, 2025. It contains multi-test cases for evaluating function completion tasks. The dataset includes a standard version and a more difficult version.

TextBenchmarkingCode GenerationSoftware Testing+1

0 views

Software Engineering & Security

The Stack GitHub Issues: 30.9 Million Conversations from Issues and Pull Requests

Conversations from GitHub issues and Pull Requests comprise 30.9 million files totaling 54GB. Each conversation includes events like opening an issue, creating a comment, or closing the issue, along with author username, text, action, and identifiers. The dataset was created by bigcode and last updated in March 2023.

TextGithubConversation DataSoftware DevelopmentIssue Tracking+1

0 views

Software Engineering & Security

Cybersecurity Detection Rules with AI-Generated Questions

A collection of 950 detection rules sourced from official SIGMA, YARA, and Suricata repositories. Knowledge distillation using the 0dAI-7.5B model was applied to generate questions and enrich responses for each rule. The dataset was created by jcordon5 and last updated on May 18, 2024.

TextDetection RulesCybersecuritySigmaSuricataYara+1

0 views

Software Engineering & Security

The Stack V2 Dedup: Near-Deduplicated Source Code from 600+ Languages

The Stack V2 Dedup is a near-deduplicated collection of source code containing between 1 billion and 10 billion records across 600+ programming languages. Produced by BigCode and last updated in April 2024, it serves as a refined subset of the full Stack v2 dataset for training large language models.

ParquetLanguagecodeTask Categoriestext GenerationLanguage Creatorsexpert GeneratedLicenseotherLibrarypolarsLanguage CreatorscrowdsourcedArxiv240219173LibrarydaskModalitytextSize Categories1 Bn10 BModalitytabularLibrarymlcroissantArxiv220714157LibrarydatasetsMultilingualitymultilingualRegionusArxiv210703374+1

0 views

Software Engineering & Security

Lubny City Council Budget Documents for 2019

2019 documents related to the local budget of the Lubny Territorial Community in Ukraine. The dataset is hosted on the eu_open_data platform and includes files in .XLSX and .ZIP formats. The dataset metadata was last updated on June 20, 2025.

TabularZIPExcelEu Open DataBudgetUkraineLocal GovernmentPublic Finance+1

0 views

PreviousPage 64 of 80Next