Loading...
Loading...
Text classification, translation, QA, summarization, dialogue, sentiment analysis, language modeling, text corpora
44,673 datasets
The Southern Australian Fractured Rock Province dataset from the Australian Ocean Data Network provides descriptive attribute information for areas bounded by spatial groundwater features. It groups descriptive topics into 11 themes, including location, geology, hydrogeology, and land use. The dataset was last updated on 2026-05-04.
A geospatial dataset from the U.S. Environmental Protection Agency's Facility Registry Service (FRS) identifying inactive hazardous waste facilities. The data integrates information from the Resource Conservation and Recovery Act Information System (RCRAInfo), which tracks generators, transporters, and disposers of hazardous waste. This subset contains facilities that were integrated into FRS and are now inactive, last updated on April 14, 2026.
Crownelius created this 1,195-row expansion of the original 11-row XXXXL Chain-of-Thought dataset, preserving its 'Narrative Technicality' style. The dataset features a stream-of-consciousness inner monologue format that performs explicit, low-level verification within the prose. It was last updated on HuggingFace on 2026-05-18 02:25:22.
EOMAP Australia Pty Ltd and EOMAP GmbH & Co.KG derived this bathymetry dataset from multispectral WorldView-3 satellite data for the Australian Government in 2022-2023. The data covers the Ashmore Reef and Cartier Island Marine Parks in Western Australia, providing a high-resolution, georeferenced map of shallow water depths. It serves as an essential environmental baseline for long-term monitoring and management of these marine parks.
30,000 square kilometers of onshore Tasmania are covered by this hydrogeological inventory for the Late Carboniferous to Late Triassic Tasmania Basin. The dataset, provided by the Australian Ocean Data Network via data.gov.au, contains descriptive attribute information grouped into themes like geology, hydrogeology, and land use. It was last updated on 2026-05-04.
A synthetic instruction-style question-answering dataset derived from the NIST Artificial Intelligence Risk Management Framework (AI RMF 1.0). It is designed to support training, fine-tuning, retrieval evaluation, and domain-specific question-answering use cases related to AI risk management, trustworthy AI, and responsible AI. The dataset was created by leeroy-jankins and was last updated on HuggingFace in May 2026.
An integrated modeling framework for high-density urban centers links land-use composition to zone-level CO₂ emissions from light-duty passenger vehicles. The dataset, created by Minghui Li and last updated in April 2026, includes three Markov-derived land-use scenarios for 2025 and 2030, processed through a calibrated transport chain. It couples mixed-use development trip generation, a taxi-GPS-calibrated gravity model, and a three-perspective CO₂ attribution scheme.
Yazhou Zhang published underlying numerical data for a study on the wheat transcription factor TaWRKY58 in April 2026. The dataset, hosted on figshare, supports a model where TaWRKY58 acts as a transcriptional repressor coordinating plant architecture and drought response. It is a 22.7 KB XLSX file.
A 2026 protocol dataset from a multi-centre prospective cohort study evaluating the HEart faiLure carer support Programme (HELP). The dataset, created by Gareth Thompson, contains quantitative and qualitative measurement instruments for 180 carers and approximately 180 patients across five sites in the United Kingdom.
ATLAS-WDS is a dataset for training models on wave directional spectra. It contains records with a 47x72 energy matrix flattened into a 3384-dimensional float32 array and corresponding skew-Gaussian anchor parameters. The dataset was created by author wuff-mann and was last updated on Hugging Face in May 2026.
307 pediatric nurses in Yichang, China were surveyed from December 20–25, 2024. The dataset likely contains general information, emotional labor scores, spiritual climate, and compassion satisfaction scores. Huiqing Liu published the data on figshare under a CC-BY-4.0 license.
Supplementary materials for a 2026 study on calc-alkaline magma differentiation include 21 data tables and 8 figures. The data presents raw and filtered global arc geochemistry, thermodynamic-geochemical modeling results, and bulk-rock and mineral compositions of andesites. It was authored by Jun Wang and shared under a CC-BY-4.0 license.
A 5.5 KB Excel file containing hardware resource consumption data for a simulated 64-user scenario. The dataset supports research on a Deep Unfolding Successive Over-Relaxation (DU-SOR) paradigm for 6G wireless systems, authored by Emmanuel Ampoma Affum and last updated on April 24, 2026. It was shared on figshare under a CC-BY-4.0 license.
Emmanuel Ampoma Affum published a 5.5 KB dataset on figshare in April 2026. The data likely contains research gaps and research question alignments related to a proposed Deep Unfolding Successive Over-Relaxation (DU-SOR) paradigm for 6G wireless systems. This paradigm aims to reduce pilot overhead and computational complexity in Multi-User MIMO systems by using a sparse Graph Transformer instead of explicit Channel State Information.
Survey data from 181 software professionals used in a study investigating the influence of individual cultural values on the adoption of fairness toolkits. The data was collected and analyzed by Stefano Lambiase using Partial Least Squares Structural Equation Modeling (PLS-SEM). The dataset was last updated on April 11, 2026.
Information on the conditions of skating rinks and arenas in Montreal, sourced from a municipal website where the majority of boroughs participate. The data is provided by the Government and Municipalities of Québec and was last updated on April 17, 2026. The HTML resource is described as temporary.
Quarterly environmental radiation monitoring results reported in millirem, as defined by Connecticut state regulations. The dataset, provided by the State of Connecticut's Department of Energy and Environmental Protection, contains records from 2008 onward, with earlier data available in hardcopy. Results below the minimum measurable quantity are recorded as 'M', and the data includes a disclaimer regarding potential errors from entry, migration, or equipment failure.
A dataset characterizing energy users in Colombia, likely containing 1732 records based on a form identifier. It originates from the Colombian Superintendency of Public Services (SSPD) via the datos.gov.co platform. The data was last updated on the platform in May 2026.
A synthetic dataset generated on 2026-05-31 by author kskkoba using the Chain-of-Thought Self-Instruct methodology. It contains reasoning data produced by the unsloth/Qwen3-30B-A3B-Thinking-2507-GGUF model, served locally via llama.cpp. The source data is derived from the gretelai/gretel-math-gsm8k-v1 dataset.
11 categories of confidential named entities are annotated for extraction from Japanese text. The dataset is designed for Supervised Fine-Tuning (SFT) of LFM2-family models, such as with LoRA, and was created by author akiFQC. It was last updated on 2026-06-06.