Loading...
Loading...
Text classification, translation, QA, summarization, dialogue, sentiment analysis, language modeling, text corpora
44,307 datasets
Geological Survey of Victoria data contains Pre-Permian geological rock units and boundary types, including faults. The dataset was compiled from surface geology maps and interpretation of magnetic, radiometric, gravity, and seismic data to produce a geologically and geophysically reasonable map. It is intended for use with the state magnetic image for additional context on magnetic properties, dyke swarms, and basalt cover.
The Régie du Bâtiment du Québec (RBQ) requires contractors, promoters, and owner-builders to hold a license for construction work. This dataset lists all active RBQ license holders, published by the Government and Municipalities of Québec. The data was last updated on April 17, 2026.
Historical gasoline and aviation fuel tax rates for Ontario, with changes documented from 2017 to 2025. The dataset includes specific rates for unleaded gasoline, leaded gasoline, aviation fuel, and Northern Ontario, provided by the Government of Ontario. It is available in CSV and HTML formats and was last updated on April 17, 2026.
Fattah Golden Superset is a large-scale supervised fine-tuning dataset built by Nomeda Labs for training the Fattah family of coding and agentic coding models. The dataset is described as a labeled superset with no baked-in training ratios, allowing researchers to filter on capability columns to create custom mixtures. The dataset was last updated on June 1, 2026.
3551 baptisms, marriages, and burials recorded in the earliest surviving church registers in Nova Scotia. Nova Scotia Archives transcribed and translated these Acadian parish records from 1702-1755 for the Acadie 2003-2005 Celebrations. The data provides a tangible link to the last generations of Acadian French living at Annapolis Royal before the Deportation.
The wmt26-mist-sample is a multilingual mix provided by the WMT26 MIST shared task organizers. It contains three types of tasks: context-based QA, open-ended QA, and mono- and cross-lingual summarization. The dataset is intended as a starting point for fine-tuning multilingual large language models.
Alberta Environment and Protected Areas and the Alberta Biodiversity Monitoring Institute developed a Native Cover indicator for Alberta. The dataset tracks aquatic and wetland native cover (AWNC) and terrestrial native cover (TNC) across Hydrological Unit Code 8 watersheds for the years 2010, 2018, 2019, 2020, and 2021. Calculations use ABMI's Wetland and Human Footprint Inventories and Alberta government's DEM-derived riparian data and watershed boundaries.
266 boreholes drilled across Alberta since 1920 are compiled in this interim release. The Alberta Geological Survey began systematically compiling borehole log information into a database in 2010. The dataset comprises three relational tables detailing project sources, borehole summaries, and geological intervals.
Alberta's historical landfill locations, digitized from three sources. The data originates from a 1982 survey by MacLaren Plansearch Lavalin, which ranked sites by potential environmental and human health risk. Subsequent evaluations by Associated Engineering in 1985 and digitization by Alberta Environment and Protected Areas contributed to this spatial dataset.
SmolKalam is a quality-filtered Arabic supervised fine-tuning dataset built as an ensemble translation of HuggingFaceTB/smoltalk2. It covers multi-turn dialogue, reasoning traces, tool and function calling, and long-context examples. The dataset was produced by AdaMLLab and last updated on June 22, 2026.
American Political Science Review Dataverse hosts replication data for a study on political persuasion and belief relevance. The research involved experiments with two large online convenience samples, using large language models to generate counterarguments targeting specific beliefs. Yamil Velez authored the dataset, which was last updated on June 18, 2026.
45,394 triplets of Korean financial text for fine-tuning sentence-embedding models, with graded relevance labels. The dataset was created by BCCard/BCAI using FAISS top-K and Claude Sonnet LLM judge for hybrid hard-negative mining. It was last updated on June 11, 2026.
27 crystal structures from a structure-binding relationship study for Bruton’s Tyrosine Kinase (BTK) inhibition. The dataset, authored by Rebekah M. West and last updated on 2026-05-14, explores targeting the PH domain with a covalent fragment that modifies a lysine in the PIP3 binding site.
NarraDolma provides a large-scale narrative characterization of the Dolma pretraining corpus. It contains approximately 3 million passages drawn from about 785,000 unique documents across all 12 Dolma sub-corpora, each labeled with a fine-grained narrative feature vector. The dataset was created by teagrjohnson and is intended as a resource for studying how narrative qualities are distributed in web-scale data.
Jiabin Dong published a collection of datasets on figshare in April 2026 for pore-scale numerical studies. The 706.3 KB collection includes files analyzing the synergistic control of grain roundness and volume on the permeability of fractal porous sandstone. It contains datasets for constructing hierarchical Voronoi porous media, comparing theoretical and actual porosity, and relating roundness to permeability via Lattice Boltzmann Method simulations.
A 2006 initiative funded with $58.9 million over five years for Geoscience Australia to acquire pre-competitive geoscience data. The program, delivered in collaboration with States and Territories, aims to attract investment in onshore energy exploration, including geothermal, petroleum, uranium, and thorium. The description outlines the program's structure and the specific Geothermal Energy Project's focus on mapping crustal temperature distribution.
A database supporting academic articles analyzing news coverage of female gubernatorial candidates during Mexico's 2021 election campaigns. The dataset is 630.0 KB in size, stored in an XLSX file, and was created by Edrei Álvarez-Monsiváis. It was last updated on 2026-05-15.
A 5.5 KB Excel file containing statistical test results from a factorial general linear mixed model analysis. The dataset reports F ratios and p-values for fixed and random effects on four plant traits: total height, flower number, leaf width, and pistil length. It was authored by Arezoo Fani and last updated on 2026-05-15.
Statistical test results from a factorial general linear mixed model fitted to four plant traits: total height, flower number, leaf width, and pistil length. The dataset reports F ratios and p-values for fixed effects (parental treatment, offspring treatment, and their interaction) and random effects (maternal plant and block). The author is Arezoo Fani, and the data was last updated on May 15, 2026.
A text dataset for biomedical information extraction, developed for the ACL 2026 Findings paper 'Applicability Condition Extraction for Therapeutic Drug-Disease Relations'. The dataset is authored by B1tta and was last updated on June 18, 2026. It focuses on identifying context-specific conditions under which a drug is therapeutically effective for a disease.