Loading...
Loading...
Source code corpora, bug reports, vulnerability databases, network intrusion detection, malware samples
1,591 datasets
BigCode released The Stack in late 2022, a 3TB collection of source code spanning 30+ programming languages and 193 permissive licenses. The dataset contains between 100 million and 1 billion near-deduplicated records of public code files scraped from the web.
596 raw PDF documents downloaded from the NIST Computer Security Resource Center (CSRC). The dataset serves as the source material for the NIST cybersecurity training dataset and the HackIDLE-NIST-Coder model. It was uploaded by ethanolivertroy and last updated on 2025-10-22.
GLOBEC project data collected by the R/V Nathaniel B. Palmer on the western Antarctic shelf between April and September 2001. The dataset includes measurements of temperature, salinity, oxygen, and photosynthetically active radiation (PAR) to understand ecosystem responses to physical forcing. It was contributed by NOAA's National Centers for Environmental Information (NCEI).
Always Further created this dataset of 10,050 labelled examples for secure code generation. It was generated using the DeepFabric open-source tool and is intended for training AI models. The dataset was last updated on October 14, 2025.
Open-MalSec is an open-source dataset curated for cybersecurity research and applications. It was created by tegridydev and last updated on March 25, 2025. The dataset encompasses labeled data from diverse cybersecurity domains, including phishing, malware analysis, and vulnerability disclosures.
SecVulEval is a collection of real-world C/C++ vulnerabilities curated by arag0rn from the National Vulnerability Database (NVD). The dataset features statement-level vulnerable information, context for vulnerable functions, and metadata such as CVE and CWE identifiers. It was last updated on October 10, -2025.
Africa Synth Telecom Cybersecurity Incident Logs Nigeria is a dataset of 30,000 synthetic security event records. It was created by electricsheepafrica and published on Hugging Face on October 5, 2025. The logs include incidents such as intrusions, DDoS attacks, and malware targeting telecom infrastructure.
100+ manually curated code samples covering languages such as Python, JavaScript, C++, and Java. The collection focuses on production-grade snippets for data structures, algorithms, and system utilities provided in Excel and CSV formats.
Capa is a malware analysis tool that identifies a file's capabilities. This dataset, created by user joyce8, contains annotations of malware capabilities such as anti-VM strings and XOR encoding. It was last updated on Hugging Face in August 2025.
Service Level Agreements (SLAs) defining time commitments for City Agencies to respond to 311 service requests. The dataset is published by the City of New York's open data portal and was last updated in March 2024.
Hourly in-situ meteorological observations from 1953 to 2005 for Canada and from 1871 to 2001 for the former USSR. The dataset is a joint compilation by the Meteorological Service of Canada, Russia's Research Institute for Hydrometeorological Information, and NOAA's National Climatic Data Center. It includes data from up to 170 active Canadian stations and over 2,000 stations across the former Soviet Union.
300 test Issue-Pull Request pairs from 11 popular Python repositories, released as part of the SWE-bench research project. The dataset was created to test systems' ability to automatically resolve real-world GitHub issues, with evaluation performed by verifying unit tests using post-PR behavior as the reference solution. It was released by the SWE-bench organization and last updated on 2025-04-29.
An open-source collection of NIST cybersecurity training documents, including a CSWP series of 23 white papers. The dataset has undergone extensive link validation, with 124,946 links fixed and 72,698 broken links cataloged. It is designed for fine-tuning large language models on cybersecurity topics.
700 YARA rules (R001–R700) curated by darkknight25 for detecting ransomware variants and identifying benign software. The dataset balances 350 malicious rules targeting ransomware with an equal number of benign rules, supporting threat detection and machine learning model training. It was last updated on June 8, 2025.
CAD-Recode provides approximately 1 million training examples of CAD code paired with point clouds. The dataset is released by filapro and accompanies a model for reverse engineering CAD designs. The validation set contains about 1,000 examples, and the dataset was last updated on March 16, 2025.
Data on administrative services provided by the 'Center for the provision of administrative services' of the Chervonohryhorivka village council executive committee. The description mentions fields for request identifiers, dates, service information, subjects, permitting bodies, performance state, service provision date, and administrative fee amounts. The dataset was last updated on 2025-05-15 and is hosted on the States site of Ukraine via the eu_open_data platform.
3.2 million malicious and benign files across six formats (Win32, Win64, .NET, APK, ELF, PDF) were compiled by joyce8 and released in 2024. This dataset updates previous EMBER iterations to support malware analysis tasks including family classification and behavior prediction. It provides seven distinct label types for multi-faceted security research.
Trendyol's dataset contains approximately 300,000 Common Vulnerabilities and Exposures (CVE) records published between 1999 and 2025. Each record has been parsed, enriched, and converted into a conversational format. The dataset is hosted on Hugging Face and was last updated in June 2025.
mvasiliniuc's dataset contains 753,693 raw Swift code files extracted from GitHub, totaling approximately 700MB of data. It was created from the public GitHub dataset on Google BigQuery with the purpose of training code generation models. The dataset was last updated on June 16, 2023.
United States precipitation and temperature records, corrected for station changes. The data includes raw, time-of-observation-adjusted, and homogenized sets for maximum, minimum, and average temperature and precipitation. This superseded version was produced by NOAA NCEI and last updated in January 2013.