Loading...
Loading...
Source code corpora, bug reports, vulnerability databases, network intrusion detection, malware samples
1,591 datasets
250 original, high-quality, and challenging olympiad-level informatics problems curated by AGI-Eval. The dataset includes problem statements, solutions, test cases, pseudo code, and difficulty levels, processed into Parquet format for efficient analysis. It was last updated on the HuggingFace platform in July 2025.
Diff-XYZ is a dataset for the paper 'Diff-XYZ: A Benchmark for Evaluating Diff Understanding'. It contains 1,000 real-world code edits sampled and filtered from the CommitPackFT dataset. The dataset was created by JetBrains-Research and was last updated on November 14, 2025.
2,000 JSONL records of advanced DeFi smart contract vulnerabilities comprise this dataset released by darkknight25 in June 2025. It focuses on unconventional attack vectors rather than standard reentrancy or overflow issues, specifically targeting Decentralized Finance protocols.
Cybersecurity QA is a dataset of instruction-response pairs focused on cybersecurity concepts, created by mariiazhiv. The dataset was last updated on September 19, 2025. It is structured in JSONL format with fields for instruction, input, and output.
CoreCodeBench Single provides test cases for evaluating code generation models. The dataset was created by author 'meituan-longcat' and was last updated on May 15, 2025. It includes verified and English-only versions of the test cases.
Reports on the implementation of budget program passports for the Executive Committee of the Pokrovsky City Council in Ukraine's Dnipropetrovsk region. The dataset is structured as annual archives, with the latest update recorded on 2025-05-13. It originates from the States site of Ukraine, an open data platform.
Information on the implementation of delegated powers by the Executive Committee of Starokostiantynivka City Council in accordance with Ukrainian law. The dataset originates from the States site of Ukraine and was last updated on June 20, 2025. It likely contains documents detailing the execution of powers delegated under the Law of Ukraine dated May 21, 1997.
OpenCodeReasoning-2 contains 1.4 million Python and 1.1 million C++ samples derived from 34,799 unique competitive programming questions. This synthetic dataset is designed for supervised fine-tuning tasks in code completion and critique. It was created by NVIDIA and released on the Hugging Face platform in May 2025.
OIBench is a high-quality, private, and challenging benchmark consisting of 250 carefully curated original problems. The dataset contains algorithm problem statements, solutions, and associated metadata such as test cases, pseudo code, and difficulty levels. It was created by meituan-longcat and last updated on July 15, 2025.
Murray, Christina's Annotation for Transparent Inquiry (ATI) data project annotates a chapter analyzing Kenya's two constitution-making processes. The first process ran from 2000 to 2005, ending with a rejected draft referendum; the second ran from 2008 to 2010, culminating in the adoption of a new constitution in August 2010. The chapter reflects on the design differences between the processes and the role of a foreign member of the drafting Committee of Experts.
32,000 instruction-following examples for training cybersecurity risk models. The dataset was curated by Vanessa Lopes and includes public reports and news, with outputs generated by GPT. It was last updated in April 2024.
1999 to 2025 coverage includes every Common Vulnerabilities and Exposures entry published in the National Vulnerability Database. The dataset was compiled by stasvinokur using a Python script to call the NVD REST API, with records available through May 30, 2025.
Every Common Vulnerabilities and Exposures entry published in the National Vulnerability Database from CVE-1999-0001 through May 30, 2025 is included. The dataset was compiled automatically by user stasvinokur using a Python script calling the NVD REST API v2.0. It was last updated on June 3, 2025.
200,000 entries combine email messages and URLs for multi-class phishing detection. The dataset, created by cybersectony, contains 22,644 email samples and 177,356 URL samples, each with a content and label field. It was last updated on Hugging Face in October 2024.
ByteDance-Seed provides a dataset of PyTorch operator test cases designed to benchmark LLMs in generating optimized CUDA kernels. It contains pairs of standard PyTorch nn.Module implementations and their performance-optimized versions using custom CUDA kernels. The dataset was last updated on August 3, 2025.
Cybersecurity Wiki Slices is a curated collection of English Wikipedia pages covering cybersecurity topics. The dataset contains approximately 24.73 million tokens and was created by tandevllc, with a last recorded update on 2025-11-05. Content originates from Wikipedia and is consolidated into Parquet for fast streaming.
Between 10,000 and 100,000 text samples for testing LLM security layers were released by Mindgard in April 2025. The data documents prompt injections and jailbreaks modified via character injection and adversarial machine learning evasion techniques.
A network traffic dataset collected using Wireshark for training AI models in cybersecurity. It includes both normal traffic and various types of simulated network attacks covering a range of common cybersecurity threats. The dataset was created by onurkya7 and last updated on 2025-01-21.
London Borough of Barnet contracted Civica UK Ltd for a committee papers content management system. The contract commenced on 23rd Feb 2022 and will run until 22nd Feb 2025, with personal data redacted. This record was published by the Government Digital Service on 2022-03 28.
355,540 rows of data collected from public GitHub repositories written in the Solidity programming language. The dataset includes information about smart contracts and associated test cases, including unit tests. It was authored by seyyedaliayati and last updated on June 23, 2023.