DataSalon

Discover quality datasets for AI training — aggregated from 40+ platforms, curated by AI.

ProductSearch Datasets Browse Topics Rankings Community API / MCP

ResourcesDocumentation Blog Changelog Status

LegalPrivacy Policy Terms of Service Cookie Policy

Software Engineering & Security Datasets | DataSalon

All Categories

🔒

Software Engineering & Security

Source code corpora, bug reports, vulnerability databases, network intrusion detection, malware samples

1,591 datasets

OIBench: 250 Olympiad-Level Informatics Problems with Solutions

250 original, high-quality, and challenging olympiad-level informatics problems curated by AGI-Eval. The dataset includes problem statements, solutions, test cases, pseudo code, and difficulty levels, processed into Parquet format for efficient analysis. It was last updated on the HuggingFace platform in July 2025.

TextParquetLibrarypolarsLibrarydaskComputer ScienceSize Categoriesn1 KModalitytextLibrarymlcroissantLicensecc By Nd 40LibrarydatasetsBenchmarkAlgorithmic ProblemsCode GenerationRegionusArxiv250610481Test CasesInformatics Benchmark+1

0 views

Software Engineering & Security

Diff-XYZ: A Benchmark of 1,000 Real-World Code Edits

Diff-XYZ is a dataset for the paper 'Diff-XYZ: A Benchmark for Evaluating Diff Understanding'. It contains 1,000 real-world code edits sampled and filtered from the CommitPackFT dataset. The dataset was created by JetBrains-Research and was last updated on November 14, 2025.

TextLlm BenchmarkCode DiffSoftware EngineeringBenchmarkCode Editing+1

0 views

Software Engineering & Security

SCV-1-2000: 2,000 Advanced DeFi Smart Contract Vulnerabilities

2,000 JSONL records of advanced DeFi smart contract vulnerabilities comprise this dataset released by darkknight25 in June 2025. It focuses on unconventional attack vectors rather than standard reentrancy or overflow issues, specifically targeting Decentralized Finance protocols.

Size Categories1 Kn10 KLanguageenCybersecuritySmartcontractRegionusBlockchainTask Categoriestext ClassificationLicensemit+1

0 views

Software Engineering & Security

Cybersecurity Instruction-Response Pairs for LLM Fine-Tuning

Cybersecurity QA is a dataset of instruction-response pairs focused on cybersecurity concepts, created by mariiazhiv. The dataset was last updated on September 19, 2025. It is structured in JSONL format with fields for instruction, input, and output.

TextCybersecurityQuestion AnsweringLlm Fine Tuning+1

0 views

Software Engineering & Security

CoreCodeBench Single: Single Test Cases for Code Generation

CoreCodeBench Single provides test cases for evaluating code generation models. The dataset was created by author 'meituan-longcat' and was last updated on May 15, 2025. It includes verified and English-only versions of the test cases.

TextBenchmarkCode GenerationSoftware TestingProgramming+1

0 views

Software Engineering & Security

Pokrovsky City Council Budget Program Reports from Dnipropetrovsk Region

Reports on the implementation of budget program passports for the Executive Committee of the Pokrovsky City Council in Ukraine's Dnipropetrovsk region. The dataset is structured as annual archives, with the latest update recorded on 2025-05-13. It originates from the States site of Ukraine, an open data platform.

TabularZIPGovernment TransparencyUkraine Local GovernmentBudget ReportsPublic Finance+1

0 views

Software Engineering & Security

Delegated Powers Implementation Records for Starokostiantynivka City Council

Information on the implementation of delegated powers by the Executive Committee of Starokostiantynivka City Council in accordance with Ukrainian law. The dataset originates from the States site of Ukraine and was last updated on June 20, 2025. It likely contains documents detailing the execution of powers delegated under the Law of Ukraine dated May 21, 1997.

TextUkraineLegal ImplementationLocal GovernmentDelegated Powers+1

0 views

Software Engineering & Security

OpenCodeReasoning-2: Python and C++ Samples for Code Critique

OpenCodeReasoning-2 contains 1.4 million Python and 1.1 million C++ samples derived from 34,799 unique competitive programming questions. This synthetic dataset is designed for supervised fine-tuning tasks in code completion and critique. It was created by NVIDIA and released on the Hugging Face platform in May 2025.

TextParquetTask Categoriestext GenerationCompetitive ProgrammingLibrarypolarsLibrarydaskSize Categories1 Mn10 MModalitytextLibrarymlcroissantLibrarydatasetsLicensecc By 40Code GenerationRegionusReasoningLarge ScaleSynthetic DataSynthetic+1

0 views

Software Engineering & Security

OIBench: 250 Olympiad-Level Informatics Problems

OIBench is a high-quality, private, and challenging benchmark consisting of 250 carefully curated original problems. The dataset contains algorithm problem statements, solutions, and associated metadata such as test cases, pseudo code, and difficulty levels. It was created by meituan-longcat and last updated on July 15, 2025.

TextCompetitive ProgrammingAlgorithm ProblemsBenchmarkInformatics Benchmark+1

0 views

Software Engineering & Security

Kenya Constitution-Making Processes Annotation Data, 2000-2010

Murray, Christina's Annotation for Transparent Inquiry (ATI) data project annotates a chapter analyzing Kenya's two constitution-making processes. The first process ran from 2000 to 2005, ending with a rejected draft referendum; the second ran from 2008 to 2010, culminating in the adoption of a new constitution in August 2010. The chapter reflects on the design differences between the processes and the role of a foreign member of the drafting Committee of Experts.

TextKENYAPolitical ProcessConstitutional LawAnnotation Transparent Inquiry+1

0 views

Software Engineering & Security

Cybersecurity Threat Identification Instruction Dataset

32,000 instruction-following examples for training cybersecurity risk models. The dataset was curated by Vanessa Lopes and includes public reports and news, with outputs generated by GPT. It was last updated in April 2024.

TextParquetSize Categories10 Kn100 KLibrarypolarsCybersecurityModalitytextRisk assessmentModalitytabularLibrarymlcroissantLibrarydatasetsLibrarypandasText ClassificationRegionusNatural Language ProcessingInstruction Tuning+1

0 views

Software Engineering & Security

Common Vulnerabilities And Exposures Database 1999-2025

1999 to 2025 coverage includes every Common Vulnerabilities and Exposures entry published in the National Vulnerability Database. The dataset was compiled by stasvinokur using a Python script to call the NVD REST API, with records available through May 30, 2025.

TabularCweCybersecuritySoftware SecurityCveVulnerability Database+1

0 views

Software Engineering & Security

CVE And CWE Vulnerability Records From 1999 To 2025

Every Common Vulnerabilities and Exposures entry published in the National Vulnerability Database from CVE-1999-0001 through May 30, 2025 is included. The dataset was compiled automatically by user stasvinokur using a Python script calling the NVD REST API v2.0. It was last updated on June 3, 2025.

TabularCweCybersecuritySoftware SecurityCveVulnerability Database+1

0 views

Software Engineering & Security

Phishing Email and URL Detection Dataset with 200k Samples

200,000 entries combine email messages and URLs for multi-class phishing detection. The dataset, created by cybersectony, contains 22,644 email samples and 177,356 URL samples, each with a content and label field. It was last updated on Hugging Face in October 2024.

TextTabularParquetUrl ClassificationLibrarypolarsCybersecurityModalitytextSize Categories100 Kn1 MLibrarymlcroissantLibrarydatasetsLibrarypandasRegionusPhishing Detection+1

0 views

Software Engineering & Security

CudaLLM Data: PyTorch Operator Test Cases for CUDA Kernel Generation

ByteDance-Seed provides a dataset of PyTorch operator test cases designed to benchmark LLMs in generating optimized CUDA kernels. It contains pairs of standard PyTorch nn.Module implementations and their performance-optimized versions using custom CUDA kernels. The dataset was last updated on August 3, 2025.

TextAi For HpcCuda KernelsPytorchBenchmarkCode GenerationCompiler OptimizationSynthetic+1

0 views

Software Engineering & Security

Cybersecurity Wikipedia Articles Consolidated into Parquet Format

Cybersecurity Wiki Slices is a curated collection of English Wikipedia pages covering cybersecurity topics. The dataset contains approximately 24.73 million tokens and was created by tandevllc, with a last recorded update on 2025-11-05. Content originates from Wikipedia and is consolidated into Parquet for fast streaming.

TextCybersecurityWikipediaInformation SecurityText Corpus+1

0 views

Software Engineering & Security

Evaded Prompt Injection and Jailbreak Samples: 10K+ Adversarial Pairs

Between 10,000 and 100,000 text samples for testing LLM security layers were released by Mindgard in April 2025. The data documents prompt injections and jailbreaks modified via character injection and adversarial machine learning evasion techniques.

ParquetSize Categories10 Kn100 KTask Categoriestext GenerationLibrarypolarsLanguageenModalitytextLibrarymlcroissantLibrarydatasetsLibrarypandasArxiv250411168Licensecc By Nc 40RegionusTask Categoriestext Classification+1

0 views

Software Engineering & Security

NADW Network Attacks Dataset: Network Traffic for Anomaly Detection

A network traffic dataset collected using Wireshark for training AI models in cybersecurity. It includes both normal traffic and various types of simulated network attacks covering a range of common cybersecurity threats. The dataset was created by onurkya7 and last updated on 2025-01-21.

TabularCybersecurityNetwork TrafficAnomaly DetectionWiresharkSynthetic+1

0 views

Software Engineering & Security

Moderngov Contract: London Borough of Barnet CMS Procurement 2022-2025

London Borough of Barnet contracted Civica UK Ltd for a committee papers content management system. The contract commenced on 23rd Feb 2022 and will run until 22nd Feb 2025, with personal data redacted. This record was published by the Government Digital Service on 2022-03 28.

Tabular🇬🇧 United KingdomGovernment ContractsSoftware ProcurementLocal Government+1

0 views

Software Engineering & Security

Solidity Smart Contracts and Test Cases from Public GitHub Repositories

355,540 rows of data collected from public GitHub repositories written in the Solidity programming language. The dataset includes information about smart contracts and associated test cases, including unit tests. It was authored by seyyedaliayati and last updated on June 23, 2023.

TabularSmart ContractsSource CodeSoliditySoftware TestingBlockchain+1

0 views

PreviousPage 63 of 80Next