Loading...
Loading...
Text classification, translation, QA, summarization, dialogue, sentiment analysis, language modeling, text corpora
43,995 datasets
Alberta's historical landfill locations, digitized from three sources. The data originates from a 1982 survey by MacLaren Plansearch Lavalin, which ranked sites by potential environmental and human health risk. Subsequent evaluations by Associated Engineering in 1985 and digitization by Alberta Environment and Protected Areas contributed to this spatial dataset.
SmolKalam is a quality-filtered Arabic supervised fine-tuning dataset built as an ensemble translation of HuggingFaceTB/smoltalk2. It covers multi-turn dialogue, reasoning traces, tool and function calling, and long-context examples. The dataset was produced by AdaMLLab and last updated on June 22, 2026.
American Political Science Review Dataverse hosts replication data for a study on political persuasion and belief relevance. The research involved experiments with two large online convenience samples, using large language models to generate counterarguments targeting specific beliefs. Yamil Velez authored the dataset, which was last updated on June 18, 2026.
45,394 triplets of Korean financial text for fine-tuning sentence-embedding models, with graded relevance labels. The dataset was created by BCCard/BCAI using FAISS top-K and Claude Sonnet LLM judge for hybrid hard-negative mining. It was last updated on June 11, 2026.
27 crystal structures from a structure-binding relationship study for Bruton’s Tyrosine Kinase (BTK) inhibition. The dataset, authored by Rebekah M. West and last updated on 2026-05-14, explores targeting the PH domain with a covalent fragment that modifies a lysine in the PIP3 binding site.
NarraDolma provides a large-scale narrative characterization of the Dolma pretraining corpus. It contains approximately 3 million passages drawn from about 785,000 unique documents across all 12 Dolma sub-corpora, each labeled with a fine-grained narrative feature vector. The dataset was created by teagrjohnson and is intended as a resource for studying how narrative qualities are distributed in web-scale data.
Jiabin Dong published a collection of datasets on figshare in April 2026 for pore-scale numerical studies. The 706.3 KB collection includes files analyzing the synergistic control of grain roundness and volume on the permeability of fractal porous sandstone. It contains datasets for constructing hierarchical Voronoi porous media, comparing theoretical and actual porosity, and relating roundness to permeability via Lattice Boltzmann Method simulations.
A 2006 initiative funded with $58.9 million over five years for Geoscience Australia to acquire pre-competitive geoscience data. The program, delivered in collaboration with States and Territories, aims to attract investment in onshore energy exploration, including geothermal, petroleum, uranium, and thorium. The description outlines the program's structure and the specific Geothermal Energy Project's focus on mapping crustal temperature distribution.
A database supporting academic articles analyzing news coverage of female gubernatorial candidates during Mexico's 2021 election campaigns. The dataset is 630.0 KB in size, stored in an XLSX file, and was created by Edrei Álvarez-Monsiváis. It was last updated on 2026-05-15.
A 5.5 KB Excel file containing statistical test results from a factorial general linear mixed model analysis. The dataset reports F ratios and p-values for fixed and random effects on four plant traits: total height, flower number, leaf width, and pistil length. It was authored by Arezoo Fani and last updated on 2026-05-15.
Statistical test results from a factorial general linear mixed model fitted to four plant traits: total height, flower number, leaf width, and pistil length. The dataset reports F ratios and p-values for fixed effects (parental treatment, offspring treatment, and their interaction) and random effects (maternal plant and block). The author is Arezoo Fani, and the data was last updated on May 15, 2026.
A text dataset for biomedical information extraction, developed for the ACL 2026 Findings paper 'Applicability Condition Extraction for Therapeutic Drug-Disease Relations'. The dataset is authored by B1tta and was last updated on June 18, 2026. It focuses on identifying context-specific conditions under which a drug is therapeutically effective for a disease.
Manually defined parameters serve as the ground-truth reference for generating synthetic cell-like clusters. The 5.5 KB XLS file contains a priori values controlling cluster shape, spread, orientation, and event number. Authored by Bradley Mason and last updated in May 2026, this dataset supports replication and accuracy assessment for the Rosetta-Routine modelling pipeline.
A 5.5 KB Excel file maps traditional descriptive statistical measures to conversion methods used by the Rosetta-Routine modeling algorithm. The mapping is intended to acquire information from unknown data and define corresponding cluster generator argument variables. Author Bradley Mason last updated the file on May 29, 2026, and it is shared under a CC-BY-4.0 license.
A 5.1 MB Excel file containing datasets used for figure generation and quantitative analyses in a manuscript. The data includes real and synthetic event-level measurements intended for population modelling. It was authored by Bradley Mason and last updated on 2026-05-29.
A 5.5 KB Excel file uploaded to figshare by Wyatt H. Bridgman on May 29, 2026. It contains data on the predictive skill of Probabilistic Predictive Trajectories (PPTs) generated using different infection-rate estimation procedures. The PPTs are scored using the Continuous Ranked Probability Score (CRPS) and have units of case counts.
Around 6,000 regulated waste management facilities in the UK report annual data on waste quantities and types received and sent on from site. This data, collected since 2006 by the Environment Agency, is used for compliance monitoring and has historically supported planning by the EC, DEFRA, and local authorities. It is published in multiple formats including an MS Access interrogator, Excel extracts, and regional summary tables.
The Waste Data Interrogator 2017 dataset contains annual waste quantity and type data reported by regulated waste management facilities in the UK. It includes data from around 6,000 sites, collected by the Environment Agency for compliance monitoring and planning. The data is provided in multiple formats including an MS Access interrogator and Excel extracts.
4.3 GB of cleaned adsorption structures from CatHub data, used for training the DBCata model. The dataset includes model checkpoints, fine-tuning scripts, and results for out-of-distribution testing. It was authored by Songze Huo and last updated on May 25, 2026.
ENERGY STAR Certified Residential Refrigerators meet specific program requirements effective from September 15, 2014 or August 5, 2021. The dataset, sourced from data.energystar.gov, includes model specifications and efficiency metrics such as Annual Energy Use and Percent Less Energy Use than US Federal Standard. It was last updated on April 3, 2026.