Loading...
Loading...
Source code corpora, bug reports, vulnerability databases, network intrusion detection, malware samples
1,586 datasets
Malware Spllited info is a dataset hosted on Kaggle. Its specific content and structure are not detailed in the available metadata. The dataset likely contains information related to malware, possibly split across different categories or features.
String sequences extracted from malicious Windows executables form the core of this dataset. The dataset is hosted on Kaggle, but its author, organization, and creation date are not specified. Details on the number of samples, specific malware families, and extraction methodology are also unavailable.
A monthly-updated list of all payments over ยฃ25,000 made by the Department for Infrastructure during the 2025/26 financial year. The data is published by OpenDataNI as part of the Northern Ireland Civil Service commitment to expenditure transparency. It is available in CSV format under the OGL-UK-3.0 license.
A historical text covering events from July 1954 to July 1965, detailing the American involvement in Vietnam. The work is structured chronologically, with chapters focusing on specific periods and events like the Battle of Ap Bac and political developments. The original source is a book titled 'Vietnam: An American Ordeal'.
Healthy People 2010 is a nationwide health promotion and disease prevention agenda for the United States. The initiative was designed as a 10-year roadmap to improve health for all people during the first decade of the 21st century. Its overarching purpose is to promote health and prevent illness, disability, and premature death.
A dataset for intrusion detection, likely containing network traffic or system logs. It is published on Kaggle, but specific details about its size, creation date, and author are not provided. The title suggests a focus on explainable artificial intelligence (XAI) for Security Operations Center (SOC) applications.
Cybersecurity Attacks Defense Dataset 2026 is a dataset published on Kaggle. Its title suggests it contains records related to cyber attacks and defensive measures. The specific content, scale, and collection methodology are unknown from the provided metadata.
Agent-native trajectories used in the daVinci-Dev project for mid-training in software engineering. The dataset includes trajectories constructed from GitHub pull requests, specifically a Python variant. The dataset was created by GAIR and was last updated on the platform in January 2026.
HoudiniVexBench is a benchmark dataset for VEX (Vector Expression Language) code generation and understanding tasks. It contains 86 tasks extracted from Houdini 21.0.596, split into code completion, documentation-to-code, and code explanation categories. The dataset, created by kelvincai and last updated in February 2026, is hosted on HuggingFace.
This dataset contains satellite imagery for land use and remote sensing applications. It supports tasks like land cover classification and environmental monitoring.
A cybersecurity dataset containing malicious IP addresses and abuse confidence scores. The dataset was sourced from Kaggle, but the author, organization, and last update date are unknown. The specific number of rows, file formats, and license details are also not provided.
Loghub provides raw system logs from three major domains: HDFS, BGL, and OpenStack. The collection is intended for research into parser-free anomaly detection methods. The dataset's author, organization, and specific size are not detailed in the provided metadata.
The International Cotton Advisory Committee's fifteenth plenary meeting in 1956 included representatives from 62 governments. The report details discussions on the world cotton supply-demand imbalance, attributing it to high prices, improved production techniques, and new production areas in underdeveloped countries. It contains policy recommendations for price stability and flexibility.
A repository of over 337,000 Common Vulnerabilities and Exposures (CVE) records sourced from the National Vulnerability Database (NVD). The dataset includes CVSS scores and other NVD metrics, covering vulnerabilities reported from 1988 through 2026. It is hosted on Kaggle and described as a clean collection of this security data.
86 benchmark tasks for VEX (Vector Expression Language) code generation and understanding, extracted from Houdini 21.0.596. The dataset, created by kelvincai, was last updated on February 11, 2026. It includes tasks for code completion, documentation-to-code, and code explanation.
TEMPEST-OSINT provides a collection of cybersecurity documents and paired question-answer sets designed for Retrieval-Augmented Generation (RAG) evaluation. Created by Costa de Moura and Manoel Malon and hosted on Harvard Dataverse, the data supports the RRAG approach for searching technical security documentation as of March 2026.
Cybersecurity document corpus and question-answer pairs designed for evaluating Retrieval-Augmented Generation (RAG) systems. Developed by Manoel Malon Costa de Moura and hosted on Harvard Dataverse, the collection supports the RRAG search methodology as of March 2026. It contains two distinct subsets: an ingestion dataset for document indexing and an evaluation dataset for performance testing.
A text dataset focused on mapping cybersecurity intents to corresponding commands. It was published on Kaggle, but the author, organization, and creation date are not specified. The dataset's volume, specific content, and structure are unknown from the provided metadata.
daVinci-Agency is a high-quality dataset for training agents on long-horizon software engineering tasks. The dataset was created by GAIR and last updated on February 4,ๆไปฌๅ็ฐ 2026. It provides trajectories mined from real-world Pull Request chains to model the software evolution process.
District of Columbia data on traffic safety, sustainability, and urban planning. The dataset is updated through March 2026 and is licensed for open use under CC-BY. Row and column counts are unspecified.