Loading...
Loading...
Source code corpora, bug reports, vulnerability databases, network intrusion detection, malware samples
1,591 datasets
AgentPack contains 1.3 million GitHub commits identified as co-authored by AI coding agents like Claude Code, OpenAI Codex, and Cursor Agent. The dataset was created by the nuprl research organization and captures commits from public projects between April and mid-August 2025. It provides a large-scale view of how AI agents contribute to real-world software development.
Municipal zoning data for the Aveyron department in France, defining areas eligible for measures to protect domestic herds from wolf predation. The dataset is updated annually by prefectural decree and was last updated on 2019-04-19. It is provided by the Bureau de Recherches Géologiques et Minières (BRGM) via the EU Open Data platform.
This database of malicious and safe URLs was created by SaibaDev and last updated in January 2026. It provides a collection of web addresses categorized for security-related machine learning tasks.
SOREL-20M Subset Dataset contains 196,534 samples for malware detection, created by reveng-grp-2025. It includes 99,506 malicious and 97,028 benign samples, each described by 2,351 EMBER v2 features. The dataset was last updated on Hugging Face on 2025-06-02.
Lists of agreements concluded by the Department of Culture of the Executive Committee of the Poltava City Council as of 31.01.2019, 01.03.2019, 01.04.2019, and 01.05.2019. The dataset was last updated on 2019-12-23 and originates from the States site of Ukraine, aggregated via the eu_open_data platform.
Nigerian cybersecurity incident logs containing 30,000 security events on telecom infrastructure. The dataset includes 14 columns and was generated on 2025-10-05 by electricsheepafrica.
JetBrains-Research provides a dataset of expert-labeled edits to AI-generated commit messages. The data was created by presenting labelers with GPT-4 generated messages for 15 commits from the CMG benchmark and asking them to manually edit the messages to a quality suitable for version control systems. The dataset was last updated on 2024-10-17.
12,987 vulnerability fix records from CVEfixes v1.0.8, covering 11,726 unique CVEs across 4,205 software repositories. The dataset includes CVE metadata such as descriptions, CVSS scores, and CWE classifications, alongside git commit data and code diffs. It was uploaded by author hitoshura25 to Hugging Face and last updated on October 14, 2025.
River geometry data for the main stem of the Humboldt River in Nevada, as defined by U.S. Geological Survey personnel. The dataset was digitized on-screen using 1994 digital orthophoto quadrangles (DOQs) and underwent a rigorous multi-level quality-control process. It was created by the U.S. Geological Survey Nevada District in 2001 for GIS-based river mile calculations.
SENTINEL-2 satellite-derived Leaf Area Index (LAI) tiles, produced by FEDEO and hosted on NASA EarthData. LAI measures half the developed area of the convex hull wrapping green canopy elements per unit ground, including contributions from understory layers. The dataset was last updated on December 31, 2021.
A 2001 workshop in Buenos Aires convened over fifty participants from Argentina, Brazil, Paraguay, Uruguay, and the USA. Organized by scientists Silvina Solman and Matilde Rusticucci under the PROSUR network, it focused on socio-economic vulnerability to climate variability. The metadata describes the event's context and includes a list of participants and related documents.
A list of current regulatory acts from the Executive Committee of the Kamenetz-Podolsk City Council in Ukraine, last updated on 2025-04-15. The dataset includes information on the acts themselves and details on basic, repeated, and periodic tracking. It is provided by the States site of Ukraine in an Excel XLSX format.
Over 5,000 GitHub repositories provide the source for this GDScript code dataset. It was created by wallstoneai in June 2025, with each repository's code and README text consolidated into a single file. The dataset was last updated on the Hugging Face platform in August 2025.
The National Vulnerability Database (NVD) is the U.S. Government repository of security automation data. It provides a standards-based foundation for automating vulnerability management and compliance, supporting efforts based on the Security Content Automation Protocols (SCAP). The database includes listings of publicly known software flaws, security configuration checklists, product names, and impact metrics.
318 code completion tasks obtained from 27 popular GitHub C/C++ repositories covering 15 Common Weakness Enumerations (CWEs). The benchmark, created by ai-sec-lab and last updated in November 2025, is built upon the ARVO dataset and is designed to evaluate large language models and agent frameworks for secure code generation.
CoreCodeBench-Multi is a dataset for benchmarking code generation models, created by meituan-longcat and last updated on May 15, 2025. It contains multi-test cases for evaluating function completion tasks. The dataset includes a standard version and a more difficult version.
Conversations from GitHub issues and Pull Requests comprise 30.9 million files totaling 54GB. Each conversation includes events like opening an issue, creating a comment, or closing the issue, along with author username, text, action, and identifiers. The dataset was created by bigcode and last updated in March 2023.
A collection of 950 detection rules sourced from official SIGMA, YARA, and Suricata repositories. Knowledge distillation using the 0dAI-7.5B model was applied to generate questions and enrich responses for each rule. The dataset was created by jcordon5 and last updated on May 18, 2024.
The Stack V2 Dedup is a near-deduplicated collection of source code containing between 1 billion and 10 billion records across 600+ programming languages. Produced by BigCode and last updated in April 2024, it serves as a refined subset of the full Stack v2 dataset for training large language models.
2019 documents related to the local budget of the Lubny Territorial Community in Ukraine. The dataset is hosted on the eu_open_data platform and includes files in .XLSX and .ZIP formats. The dataset metadata was last updated on June 20, 2025.