Loading...
Loading...
Source code corpora, bug reports, vulnerability databases, network intrusion detection, malware samples
1,561 datasets
A multi-format, process-centric code dataset for training LLM agents. The dataset was empirically validated on 2026-04-10, where fine-tuning a model on version 1.7 with 108,000 training samples for 3 epochs produced a significant performance improvement on the ProcessFlow-Eval benchmark. It was authored by caiovicentino1 and last updated on 2026-04-11.
A synthetic dataset of 6,000 Python programming problems distilled from the Qwen3.5-397B-A17B model. Each example includes a problem statement, a chain-of-thought reasoning trace, and an execution-verified correct solution. The dataset was created by author Madras1 and last updated on 2026-04-07.
Yarsha Khola watershed in Nepal is covered by a GIS database containing land use vector data for 1961, 1981, 1992, and 1996. The data includes polygon features digitized from topographic maps, Land Resources Mapping Project (LRMP) data, and air photographs, as well as Village Development Committee (VDC) boundaries and a 1-meter resolution orthophoto mosaic. The dataset is hosted on NASA EarthData and originates from the organization CEOS_EXTRA.
A HTML transcript of a parliamentary committee appearance by the Secretary of State for Children and Youth. The appearance occurred on November 18, 2025, before the Standing Committee on Human Resources, Skills and Social Development and the Status of Persons with Disabilities (HUMA). It documents a discussion on the mandate and priorities related to children and youth policy in Canada.
A briefing package prepared for a ministerial appearance before the House Standing Committee on Environment and Sustainable Development. The document outlines the government's position and information for a study on Industrial Carbon Pricing. It was published by Environment and Climate Change Canada and last updated in April 2026.
Every public event on GitHub, including pushes, pull requests, issues, stars, forks, code reviews, releases, and discussions. The data covers activity across over 200 million repositories used by tens of millions of developers. It was created by open-index and last updated in March 2026.
1 to 10 million records of cybersecurity vulnerability and exploit data compiled by jason-oneal using the pentestds pipeline as of March 2026. It merges MITRE CVE, NVD CVSS scores, and ExploitDB entries into instruction-tuning formats like Alpaca and ChatML.
Legacy product - no abstract available. The dataset is a report from an open meeting of the JOIDES planning committee held in Zurich, September 26-28, 1973, concerning the future of the Deep Sea Drilling Project after 1975. It is published by the Australian Ocean Data Network on data_gov_au and was last updated on 2026-04-16.
GPM_BASETRMMTMI contains unaltered raw data from the TRMM Microwave Imager (TMI) instrument aboard the TRMM satellite. The product repackages raw binary CCSDS packets into HDF5 format and geolocates the sample data. It is produced by the National Aeronautics and Space Administration and was last updated in March 2026.
NASA GISS Surface Temperature (GISTEMP) analysis measures changing global surface temperature with monthly resolution from 1880 onward. The dataset is produced by the Goddard Institute for Space Studies using adjusted data from the Global Historical Climatology Network, US Historical Climatology Network, and Antarctic station data. It is updated monthly by the GISS team, though the CDIAC presentation is updated annually.
34 cases of comprehensive peace agreements signed between 1989 and 2012, analyzed to understand transitions from intrastate conflict to peace. The dataset, created by İbrahim Kumek, uses fuzzy-set qualitative comparative analysis to examine conditions like power-sharing and international assistance. It was last updated in February 2026.
A 1998 treaty text establishes a framework to promote and protect bilateral investments between Canada and Panama. It is an archived document from Global Affairs Canada, referenced for research or recordkeeping purposes only. The treaty sets out commitments on fair treatment, expropriation safeguards, and dispute settlement mechanisms.
A spectral library for aquatic substrates from the Adelaide Coastal Waters, collected in 2003. The data is hosted in the Australian National Spectral Database and was cited in a technical report for the Adelaide Coastal Waters Study Steering Committee in 2007. The dataset is managed by Geoscience Australia and was last updated in March 2026.
A spectral library for aquatic substrates hosted in the National Spectral Database. The data was cited in a 2007 technical report on remote sensing of marine and coastal features for the Adelaide Coastal Waters Study. Geoscience Australia provides access to this dataset through the Australian National Spectral Database.
Per-hunk code generation dataset derived from openSUSE Build Service maintenance patches. It provides vulnerable code regions and descriptions of upstream CVE fixes for training models to output fixed code versions. The dataset was created by openSUSE and last updated in April 2026.
The Allen Institute for Neural Dynamics (AIND) shares raw and derived data collected from mice, including preliminary data from methods development, as near to the time of collection as possible. Data is shared publicly with metadata under a CC-BY-4.0 license. The specific temporal coverage and scale of the dataset are not provided.
Krishnapadala55's dataset contains cybersecurity training data for educational and defensive purposes. The data likely includes descriptions of vulnerability exploitation techniques, security testing payloads, and attack methodologies. The dataset was last updated on 2026-04-08.
2,000 email stimuli and 220,000 adjudicated model evaluations comprise the PhishNChips benchmark for evaluating how system prompt configurations influence the security behavior of LLM-based email agents. The canonical v5.2 release was authored by AreLit and last updated on April 8, 2026.
A report published by the Australian Ocean Data Network on the Bureau of Mineral Resources (BMR) marine program. The document is a forward-looking plan for marine activities. It is available in PDF and HTML formats and was last updated on April 10, 2026.
A collaborative project between the Communications Security Establishment and the Canadian Institute for Cybersecurity generated this dataset using profiles to systematically create realistic network traffic. It includes seven distinct attack scenarios—Brute-force, Heartbleed, Botnet, DoS, DDoS, Web attacks, and infiltration—simulated across an infrastructure of 50 attacking machines and a victim organization with 420 PCs and 30 servers. The data comprises network traffic and log files, with 80 features extracted using CICFlowMeter-V3.