Loading...
Loading...
Text classification, translation, QA, summarization, dialogue, sentiment analysis, language modeling, text corpora
44,435 datasets
A 31.1 KB answer key for training and testing AI models, published on figshare under a CC-BY-4.0 license. It pairs natural language questions about developing economies with corresponding SPARQL queries for extracting financial risk data from a knowledge graph. The dataset was authored by Adishesh Gonibeed Ravishankar and last updated on 2026-05-30.
5.5 KB Excel file containing a frequency normalization table per 100,000 words. The data was created by Aman Matebie Dagnaw and last updated in April 2026. It supports a study comparing an Ethiopian Students Corpus to the British National Corpus to analyze learner errors in past simple habitual constructions.
A dataset from a study comparing corpus-based and conventional grammar instruction for teaching past simple habitual expressions. The 14.2 KB XLSX file contains quantitative and qualitative analysis of written tests from Ethiopian students, with comparisons to the British National Corpus. Author Aman Matebie Dagnaw published the dataset on figshare in April 2026.
Sanjeev B. Khanagar published a research paper on figshare in 2026. The document presents a comparative cross-sectional feasibility study evaluating the accuracy, quality, reliability, and readability of responses from three large language models to 15 common pediatric dentistry questions. The study includes expert evaluations using standardized tools and statistical analysis of the results.
Raw clotting time data used to generate doseโresponse curves for figures in a research paper. The dataset includes individual measurements plotted against Bothrops jararaca venom concentration across multiple experimental conditions. It was authored by Adrielly Viveiros Torres and last updated on 2026-05-26.
A 9.5 KB Excel file lists major research collaborations and institutional networks contributing to rickettsial disease research in Southeast Asia. The dataset was authored by Stuart D. Blacksell and last updated on May 26, 2026. It highlights key groups and partnerships but is not intended to be an exhaustive list.
Pyrolysis and bulk kinetic studies investigate the hydrocarbon generation potential of marine organic-rich rocks from the Middle Ordovician Goldwyer Formation in Western Australia. The dataset includes Rock Eval pyrolysis results and kinetic parameters for immature to mid-mature calcareous mudstones, distinguishing between oil-prone Type I and mixed oil/gas-prone Type II/III kerogen. This research, published in the International Journal of Coal Geology in 2020, provides basin-specific kinetic inputs for burial history modeling on the Broome Platform.
From January 2017 to March 2026, this dataset records corrective measures imposed for behaviors contrary to coexistence as stipulated in Colombia's Law 1801 of 2016. It is published by the Colombian open data portal, www.datos.gov.co, and includes details on infractions, demographics, and spatiotemporal occurrence. The data is structured with over 25 columns covering the legal framework, offender profiles, and precise incident timing.
The dataset tracks the number of people affiliated with health insurance in the Municipality of Envigado, Colombia. It is disaggregated by health promoting entities (EPS), insurance regime (subsidized and contributory), and includes the non-affiliated poor population (PPNA). The data covers the years 2019, 2020, and 2021 and is hosted on the Colombian open data portal www.datos.gov.co.
Nemotron-SFT-SWE-v3 is a software engineering instruction tuning dataset designed to advance the capabilities of LLMs on SWE-Bench style tasks. It includes agentic trajectories collected using a variety of agent harnesses, including the OpenHands, SWE-agent, and mini-SWE-agent frameworks. The dataset was created by NVIDIA Corporation on 2026-06 04 and is ready for commercial use.
Seabed morphology and geomorphology maps for a subset of Zeehan Marine Park, derived from a 2-meter resolution bathymetry DEM. The data product was created by Geoscience Australia using semi-automated GIS mapping tools applied to multibeam survey data. It classifies seabed features using a nationally consistent classification scheme, with interpretations informed by backscatter intensity and seabed imagery.
MathArena's Brokenarxiv dataset contains training data generated from past ArXiv articles, together with outputs generated by the Qwen3.6-35B language model. The dataset includes model answers to questions about the original statements in the articles. The dataset page was last updated on 2026-06-16.
Beneficiarios Mi Negocio is a dataset from the Colombian open data portal datos.gov.co describing recipients of a government program that develops productive projects and generates income through business capitalization. It contains 24 columns tracking beneficiary demographics, benefit types, amounts, and administrative details. The dataset was last updated on 2026-05-18.
Training data generated from past ArXiv articles includes outputs from the Qwen3.6-35B model. The dataset contains the model's answers on whether perturbed mathematical statements are correct, with the expected answer always being disprove. It was created by MathArena and last updated on June 16, 2026.
A paleontological study investigating bryozoan faunas from the Ordovician, Silurian, and Devonian periods in Australia. The work was compiled by Geoscience Australia Data, with a last update recorded for 2026-04-20. It focuses on specific fossil-rich horizons in central-western New South Wales and the Fitzroy Basin.
A 2019 spatial dataset of private water supplies in Northern Ireland, required to be held by the Drinking Water Inspectorate. It consists of 100m by 100m polygons randomly placed around registered supplies to public, commercial, or multi-dwelling premises. The dataset was created by the Government Digital Service on 31 December 2019 and was superseded in April 2020.
A spatial dataset of 100m by 100m squares randomly placed around registered private water supplies in Northern Ireland. The register includes supplies to public or commercial premises or two or more private dwellings, as required by the Private Water Supplies Regulations (Northern Ireland) 2017. This dataset was created by the Drinking Water Inspectorate on 31st December 2019 and superseded on 24th April 2020.
The Drinking Water Inspectorate holds a register of private water supplies in Northern Ireland under the Private Water Supplies Regulations (Northern Ireland) 2017. This spatial dataset represents registered supplies as 100m by 100m squares, including both current and historically monitored supplies. The dataset was created on 29 June 2021 and superseded on 27 September 2021.
The Drinking Water Inspectorate maintains a register of private water supplies for human consumption in Northern Ireland, as required by the Private Water Supplies Regulations (Northern Ireland) 2017. This spatial dataset represents registered supplies as 100m by 100m square polygons, created by the Government Digital Service on December 31, 2019. It includes supplies to public, commercial, or multiple private dwellings that are or were historically monitored.
Anonymized data from a study published in npj Complexity, used to analyze collective sleep and activity patterns among college students. The dataset is 28.2 MB in size and was last updated on April 30, 2026. Mikaela Irene Fudolig is the author, and the data is shared under a CC-BY-4.0 license.