DataSalon

Discover quality datasets for AI training — aggregated from 40+ platforms, curated by AI.

ProductSearch Datasets Browse Topics Rankings Community API / MCP

ResourcesDocumentation Blog Changelog Status

LegalPrivacy Policy Terms of Service Cookie Policy

NLP & Text Datasets | DataSalon

All Categories

📝

NLP & Text

Text classification, translation, QA, summarization, dialogue, sentiment analysis, language modeling, text corpora

44,435 datasets

NLP & Text

GEMR KG: Ground Truth Questions for AI Generated SPARQL Query Validation

A 31.1 KB answer key for training and testing AI models, published on figshare under a CC-BY-4.0 license. It pairs natural language questions about developing economies with corresponding SPARQL queries for extracting financial risk data from a knowledge graph. The dataset was authored by Adishesh Gonibeed Ravishankar and last updated on 2026-05-30.

TextGraphFinancial RiskQuestion AnsweringAi TrainingSparqlFinanceNatural Language ProcessingSynthetic+1

0 views

NLP & Text

Frequency Normalization Table for Past Simple Habitual Expressions

5.5 KB Excel file containing a frequency normalization table per 100,000 words. The data was created by Aman Matebie Dagnaw and last updated in April 2026. It supports a study comparing an Ethiopian Students Corpus to the British National Corpus to analyze learner errors in past simple habitual constructions.

TabularExcelLearner CorpusGrammar AccuracyNatural Language ProcessingCorpus Based InstructionEnglish Language Learning+1

0 views

NLP & Text

Frequency Distribution of Past Simple Tense for Ethiopian Learners

A dataset from a study comparing corpus-based and conventional grammar instruction for teaching past simple habitual expressions. The 14.2 KB XLSX file contains quantitative and qualitative analysis of written tests from Ethiopian students, with comparisons to the British National Corpus. Author Aman Matebie Dagnaw published the dataset on figshare in April 2026.

TabularExcelLearner CorpusPast TenseCorpus LinguisticsNatural Language ProcessingEnglish GrammarEducation Research+1

0 views

NLP & Text

Comparative Study of LLM Responses to 15 Pediatric Dentistry Queries

Sanjeev B. Khanagar published a research paper on figshare in 2026. The document presents a comparative cross-sectional feasibility study evaluating the accuracy, quality, reliability, and readability of responses from three large language models to 15 common pediatric dentistry questions. The study includes expert evaluations using standardized tools and statistical analysis of the results.

TextHealth informaticsBenchmarkHealthcareClinical EvaluationPediatric DentistryLarge Language ModelsSynthetic+1

0 views

NLP & Text

Clotting Time Measurements for Bothrops Jararaca Venom Dose-Response Curves

Raw clotting time data used to generate dose–response curves for figures in a research paper. The dataset includes individual measurements plotted against Bothrops jararaca venom concentration across multiple experimental conditions. It was authored by Adrielly Viveiros Torres and last updated on 2026-05-26.

TabularExcelClotting TimeBiochemistryDose ResponseVenom Concentration+1

0 views

NLP & Text

Rickettsial Disease Research Collaborations and Networks in Southeast Asia

A 9.5 KB Excel file lists major research collaborations and institutional networks contributing to rickettsial disease research in Southeast Asia. The dataset was authored by Stuart D. Blacksell and last updated on May 26, 2026. It highlights key groups and partnerships but is not intended to be an exhaustive list.

TabularExcelSoutheast AsiaRickettsial DiseaseHealthcareResearch CollaborationsInstitutional Networks+1

0 views

NLP & Text

Hydrocarbon Generation Kinetics for the Ordovician Goldwyer Formation, Canning Basin

Pyrolysis and bulk kinetic studies investigate the hydrocarbon generation potential of marine organic-rich rocks from the Middle Ordovician Goldwyer Formation in Western Australia. The dataset includes Rock Eval pyrolysis results and kinetic parameters for immature to mid-mature calcareous mudstones, distinguishing between oil-prone Type I and mixed oil/gas-prone Type II/III kerogen. This research, published in the International Journal of Coal Geology in 2020, provides basin-specific kinetic inputs for burial history modeling on the Broome Platform.

TabularPetroleum GeologySource rockHydrocarbon GenerationGeochemistryKinetic Modeling+1

0 views

NLP & Text

Corrective Measures in Bucaramanga Municipality from 2017 to 2026

From January 2017 to March 2026, this dataset records corrective measures imposed for behaviors contrary to coexistence as stipulated in Colombia's Law 1801 of 2016. It is published by the Colombian open data portal, www.datos.gov.co, and includes details on infractions, demographics, and spatiotemporal occurrence. The data is structured with over 25 columns covering the legal framework, offender profiles, and precise incident timing.

TabularTime SeriesCSVXMLJSONLaw EnforcementColombiaPublic SafetyCorrective MeasuresMunicipal Data+1

0 views

NLP & Text

Health Insurance Coverage in Envigado Municipality by Insurer and Regime, 2019-2021

The dataset tracks the number of people affiliated with health insurance in the Municipality of Envigado, Colombia. It is disaggregated by health promoting entities (EPS), insurance regime (subsidized and contributory), and includes the non-affiliated poor population (PPNA). The data covers the years 2019, 2020, and 2021 and is hosted on the Colombian open data portal www.datos.gov.co.

TabularCSVXMLJSONHealth InsuranceColombiaMunicipal DataPublic Health+1

0 views

NLP & Text

Nemotron Sft Swe V3: Software Engineering Instruction Tuning Dataset

Nemotron-SFT-SWE-v3 is a software engineering instruction tuning dataset designed to advance the capabilities of LLMs on SWE-Bench style tasks. It includes agentic trajectories collected using a variety of agent harnesses, including the OpenHands, SWE-agent, and mini-SWE-agent frameworks. The dataset was created by NVIDIA Corporation on 2026-06 04 and is ready for commercial use.

TextSoftware EngineeringCode GenerationInstruction TuningAgent Trajectories+1

0 views

NLP & Text

Seabed Morphology and Geomorphology of Zeehan Marine Park, Australia

Seabed morphology and geomorphology maps for a subset of Zeehan Marine Park, derived from a 2-meter resolution bathymetry DEM. The data product was created by Geoscience Australia using semi-automated GIS mapping tools applied to multibeam survey data. It classifies seabed features using a nationally consistent classification scheme, with interpretations informed by backscatter intensity and seabed imagery.

Geospatial🇦🇺 AustraliaZIPMarine ParkSeabed MorphologyMarine GeomorphologyBathymetry+1

0 views

NLP & Text

Brokenarxiv: Qwen3.6-35B Outputs on ArXiv-Derived Training Data

MathArena's Brokenarxiv dataset contains training data generated from past ArXiv articles, together with outputs generated by the Qwen3.6-35B language model. The dataset includes model answers to questions about the original statements in the articles. The dataset page was last updated on 2026-06-16.

TextMathematical TextTraining DataArxivSyntheticLanguage Model Outputs+1

0 views

NLP & Text

Beneficiarios Mi Negocio: Colombian Business Grant Recipients with Demographic Details

Beneficiarios Mi Negocio is a dataset from the Colombian open data portal datos.gov.co describing recipients of a government program that develops productive projects and generates income through business capitalization. It contains 24 columns tracking beneficiary demographics, benefit types, amounts, and administrative details. The dataset was last updated on 2026-05-18.

TabularCSVXMLJSONSocial WelfareColombiaDemographicsBusiness GrantsGovernment Program+1

0 views

NLP & Text

Brokenarxiv Training Outputs Disprove: Qwen3.6-35B Responses to Perturbed ArXiv Statements

Training data generated from past ArXiv articles includes outputs from the Qwen3.6-35B model. The dataset contains the model's answers on whether perturbed mathematical statements are correct, with the expected answer always being disprove. It was created by MathArena and last updated on June 16, 2026.

TextStatement VerificationMathematical TextModel EvaluationLlm TrainingArxiv DerivedSynthetic+1

0 views

NLP & Text

Australian Ordovician, Silurian, and Devonian Bryozoa Fossil Records

A paleontological study investigating bryozoan faunas from the Ordovician, Silurian, and Devonian periods in Australia. The work was compiled by Geoscience Australia Data, with a last update recorded for 2026-04-20. It focuses on specific fossil-rich horizons in central-western New South Wales and the Fitzroy Basin.

Text🇦🇺 AustraliaGeologyBryozoaPalaeozoicFossil+1

0 views

NLP & Text

Register of Private Water Supplies in Northern Ireland as of December 2019

A 2019 spatial dataset of private water supplies in Northern Ireland, required to be held by the Drinking Water Inspectorate. It consists of 100m by 100m polygons randomly placed around registered supplies to public, commercial, or multi-dwelling premises. The dataset was created by the Government Digital Service on 31 December 2019 and was superseded in April 2020.

GeospatialCSVJSONGovernment RegistryNorthern IrelandWater SupplyPublic Health+1

0 views

NLP & Text

Register of Private Water Supplies in Northern Ireland as of December 2019

A spatial dataset of 100m by 100m squares randomly placed around registered private water supplies in Northern Ireland. The register includes supplies to public or commercial premises or two or more private dwellings, as required by the Private Water Supplies Regulations (Northern Ireland) 2017. This dataset was created by the Drinking Water Inspectorate on 31st December 2019 and superseded on 24th April 2020.

GeospatialCSVJSONEnvironmental HealthNorthern IrelandWater SupplyGovernment Register+1

0 views

NLP & Text

Register of Private Water Supplies in Northern Ireland, 2021

The Drinking Water Inspectorate holds a register of private water supplies in Northern Ireland under the Private Water Supplies Regulations (Northern Ireland) 2017. This spatial dataset represents registered supplies as 100m by 100m squares, including both current and historically monitored supplies. The dataset was created on 29 June 2021 and superseded on 27 September 2021.

GeospatialCSVJSONGovernment RegistryNorthern IrelandWater SupplyPublic Health+1

0 views

NLP & Text

Register of Private Water Supplies in Northern Ireland as of December 2019

The Drinking Water Inspectorate maintains a register of private water supplies for human consumption in Northern Ireland, as required by the Private Water Supplies Regulations (Northern Ireland) 2017. This spatial dataset represents registered supplies as 100m by 100m square polygons, created by the Government Digital Service on December 31, 2019. It includes supplies to public, commercial, or multiple private dwellings that are or were historically monitored.

GeospatialCSVJSONGovernment RegistryNorthern IrelandWater SupplyPublic Health+1

0 views

NLP & Text

Collective Sleep and Activity Patterns of College Students from Wearable Devices

Anonymized data from a study published in npj Complexity, used to analyze collective sleep and activity patterns among college students. The dataset is 28.2 MB in size and was last updated on April 30, 2026. Mikaela Irene Fudolig is the author, and the data is shared under a CC-BY-4.0 license.

TabularTime SeriesCSVCollege StudentsActivity TrackingWearable DevicesSynthetic+1

0 views

PreviousPage 264 of 2217Next