Loading...
Loading...
Text classification, translation, QA, summarization, dialogue, sentiment analysis, language modeling, text corpora
43,998 datasets
A 25.1 MB collection by Tugba Y. Ozmen, last updated in April 2026, investigates assays for homologous recombination deficiency and replication stress in cancer. The work includes a comparative pan-cancer analysis of therapy efficacy and toxicity based on results from clinicaltrials.gov. It explores the integration of these pathways with immune contexture to inform next-generation treatment strategies.
Archived records of civil security events in Quebec, systematically grouped by the Ministry of Public Security. The database documents event consequences, evolution, and categorizes them by impact level and emergency response required, based on the Canadian Common Alert Protocol profile. Data compilation includes reports from the Government Operations Center and regional directorates since 1996.
121,422 expert-level instruction-response pairs for offensive cybersecurity tasks. The dataset was created by author 'oyildirim' and is described as the largest open-source offensive cybersecurity SFT dataset. It was last updated on June 17, 2026.
Colombian data on graduates from the Colegio Mayor del Cauca university institution, starting from the 2011-I semester. The dataset tracks the number of graduates by program, period, academic level, and gender. It is published via the Colombian open data portal.
An unpublished database analyzing front pages and opinion columns from major-circulation newspapers for gender stereotypes targeting presidential candidates Sheinbaum and Gálvez. The 189.1 KB XLSX file served as the basis for academic publications and conference presentations. It was last updated on 2026-05-15.
A meta-epidemiological study protocol analyzes publication delays in systematic reviews. The dataset likely contains records of interventional, RCT-based meta-analyses published in top-tier general medical journals and the Cochrane Database of Systematic Reviews between 2023 and 2025. Jia Song authored this protocol, which was uploaded to figshare in April 2026.
A cleaned Wikipedia corpus combines Serbian and Croatian Wikipedia articles. Croatian text has been transliterated to Cyrillic script, and wiki markup, infoboxes, and stub articles have been removed. The corpus was compiled by RafaelUI and is available on Hugging Face.
A geospatial analysis compares the levelized cost of heat and carbon removal for three decarbonized thermal energy sources across the United States. The study uses detailed process models for sedimentary basin geothermal, concentrated solar, and heat pump technologies, with sorbent-based direct air capture as a case study. The dataset was authored by Caleb H. Geissler and last updated on April 28, 2026.
Experimental data for a series of novel podophyllotoxin derivatives designed to target LAT1 transporters for esophageal cancer treatment. The dataset includes results for the lead compound B11, showing a 64.6% tumor growth inhibition in mice and a more than 4-fold improvement in tolerability compared to etoposide. The data was authored by Manwei Jia and last updated on 2026-04 14.
Wistar rats (Rattus norvegicus) were the source for isolated cardiac mitochondria used to study the direct effects of dapagliflozin. The dataset, authored by Itanna Isis Araújo de Souza and last updated in May 2026, contains findings on oxygen consumption, ATP production, ROS generation, and membrane potential. It is shared under a CC-BY-4.0 license as a 1.9 MB DOCX file.
Lord Howe Island shelf in NSW was surveyed by Geoscience Australia in 2008. The survey mapped seabed bathymetry and characterized benthic environments using sediment sampling, rock coring, underwater video, and current measurements. The lh_back_8m grid is a processed backscatter product covering 1034 sq km, derived from EM300 data.
Australian marine physical environmental data includes metadata for 37 variables collated by the Marine Biodiversity Hub. Bathymetry, geomorphology, seabed sediment, and seabed exposure data were produced by Geoscience Australia, while bottom-water and surface-water parameters were produced by CSIRO. All data were transformed to a common datum (WGS84) and gridded at a 0.01-degree cell size.
Geomorphological features of the Great Artesian Basin, including offshore extents beneath the Gulf of Carpentaria. The dataset classifies features into five categories based on depositional environment: Marine, Fluvial, Aeolian, Playa-lacustrine, and Erosional terrain. It was produced by Geoscience Australia and is available via the Australian Ocean Data Network.
The Australian Ocean Data Network hosts a collection of abstracts for academic papers on sulphide ore formation in sedimentary rocks. The abstracts cover topics including models of ore formation, metal sources, lead isotopic systematics, and diagenetic mineralization, with specific references to deposits like Mount Isa and Coxco. The dataset was last updated on 2026-05-05.
VisReason is a large-scale dataset designed to advance visual Chain-of-Thought reasoning in multimodal large language models. It supervises a human-like, global-to-local reasoning process where models first form a holistic hypothesis about a scene before iteratively zooming into salient regions. The dataset was created by Y-Research-Group and was last updated on June 21, 2026.
A 5.5 KB Excel file containing a list of symbols related to smart charging algorithms for electric vehicles. The dataset was created by Felix Wieberneit and last updated on April 22, 2026. It supports research demonstrating a potential 37% annual reduction in carbon intensity from controlled EV charging.
Sarah Hornfeck's dataset, last updated April 22, 2026, presents sgRNAs used for generating stable Kis. The 5.5 KB Excel file contains data from a study highlighting the importance of analyzing proteins at endogenous levels, showing colocalization of Rab11 and LAMP1 varied drastically between endogenous and ectopic expression conditions.
An initial value algorithm examines the time-dependent evolution of electromagnetic fields from oblique scattering of bounded pulses from an infinite planar dielectric interface. The qubit lattice algorithm (QLA) is utilized, which is almost fully unitary, leading to excellent conservation of electromagnetic energy. The dataset was created by Min Soe, George Vahala, Linda Vahala, Efstratios Koukoutsis, Abhay K. Ram, and Kyriakos Hizanidis and was last updated on June 23, 2026.
A free preview subset of a larger proprietary dataset developed by Egomnia S.p.A. The data consists of a raw Italian text corpus derived from content sourced from the italia.progettotalia.it website. The full dataset is not included in this repository and can be purchased separately.
Harris Greenstone Domain GIS data delineates a late Archean-Proterozoic tectonostratigraphic terrane within South Australia's Gawler Craton. The dataset characterizes the Archean Harris Greenstone Belt, including komatiite, basalt, and banded iron formation, metamorphosed during the ~2440 Ma Sleafordian Orogeny. Its interpretation is based on aeromagnetic and gravity surveys, supplemented by diamond drillcore, to map structures beneath thin Quaternary and Eocene cover.