DataSalon

Discover quality datasets for AI training — aggregated from 40+ platforms, curated by AI.

ProductSearch Datasets Browse Topics Rankings Community API / MCP

ResourcesDocumentation Blog Changelog Status

LegalPrivacy Policy Terms of Service Cookie Policy

NLP & Text Datasets | DataSalon

All Categories

📝

NLP & Text

Text classification, translation, QA, summarization, dialogue, sentiment analysis, language modeling, text corpora

44,307 datasets

NLP & Text

Geological polygons based on geological interpretation of geophysical data (1:1,000,000)

Geological Survey of Victoria data contains Pre-Permian geological rock units and boundary types, including faults. The dataset was compiled from surface geology maps and interpretation of magnetic, radiometric, gravity, and seismic data to produce a geologically and geophysically reasonable map. It is intended for use with the state magnetic image for additional context on magnetic properties, dyke swarms, and basalt cover.

GeospatialGeologyGeospatial PolygonsGeophysicsComputer VisionVictoria AustraliaRock Units+1

0 views

NLP & Text

Active Construction Contractor Licenses in Quebec

The Régie du Bâtiment du Québec (RBQ) requires contractors, promoters, and owner-builders to hold a license for construction work. This dataset lists all active RBQ license holders, published by the Government and Municipalities of Québec. The data was last updated on April 17, 2026.

TabularCSVJSONGovernment DataConstruction RegulationQuebecBusiness Licenses+1

0 views

NLP & Text

Ontario Fuel Tax Rates with Historical Price Points

Historical gasoline and aviation fuel tax rates for Ontario, with changes documented from 2017 to 2025. The dataset includes specific rates for unleaded gasoline, leaded gasoline, aviation fuel, and Northern Ontario, provided by the Government of Ontario. It is available in CSV and HTML formats and was last updated on April 17, 2026.

Tabular🇨🇦 CanadaCSVFuel TaxTaxationGovernment PolicyEnergy+1

0 views

NLP & Text

Fattah Golden Superset: A Model-Agnostic SFT Corpus for Coding Models

Fattah Golden Superset is a large-scale supervised fine-tuning dataset built by Nomeda Labs for training the Fattah family of coding and agentic coding models. The dataset is described as a labeled superset with no baked-in training ratios, allowing researchers to filter on capability columns to create custom mixtures. The dataset was last updated on June 1, 2026.

TextModel TrainingCode GenerationLarge ScaleNatural Language ProcessingAgentic CodingSupervised Finetuning+1

0 views

NLP & Text

St. Jean-Baptiste Parish Registers, 1702-1755, 3551 Events

3551 baptisms, marriages, and burials recorded in the earliest surviving church registers in Nova Scotia. Nova Scotia Archives transcribed and translated these Acadian parish records from 1702-1755 for the Acadie 2003-2005 Celebrations. The data provides a tangible link to the last generations of Acadian French living at Annapolis Royal before the Deportation.

TabularCSVXMLGenealogyParish RecordsHistorical RegistersDemographicsAcadian History+1

0 views

NLP & Text

Wmt26 Mist Sample: Multilingual Tasks for LLM Fine-Tuning

The wmt26-mist-sample is a multilingual mix provided by the WMT26 MIST shared task organizers. It contains three types of tasks: context-based QA, open-ended QA, and mono- and cross-lingual summarization. The dataset is intended as a starting point for fine-tuning multilingual large language models.

TextMultilingualMachine TranslationMultilingual QaLlm Fine TuningText Summarization+1

0 views

NLP & Text

Alberta Native Cover Indicator for Watersheds, 2010-2021

Alberta Environment and Protected Areas and the Alberta Biodiversity Monitoring Institute developed a Native Cover indicator for Alberta. The dataset tracks aquatic and wetland native cover (AWNC) and terrestrial native cover (TNC) across Hydrological Unit Code 8 watersheds for the years 2010, 2018, 2019, 2020, and 2021. Calculations use ABMI's Wetland and Human Footprint Inventories and Alberta government's DEM-derived riparian data and watershed boundaries.

GeospatialZIPXMLEnvironmental monitoringWetlandsHuman FootprintLand UseAlberta+1

0 views

NLP & Text

Alberta Geological Survey Borehole Data, Interim Release of 266 Boreholes

266 boreholes drilled across Alberta since 1920 are compiled in this interim release. The Alberta Geological Survey began systematically compiling borehole log information into a database in 2010. The dataset comprises three relational tables detailing project sources, borehole summaries, and geological intervals.

TabularGeospatial🇨🇦 CanadaXMLExcelGeologyBorehole DataGeological Survey+1

0 views

NLP & Text

Historical Landfills in Alberta with Environmental Risk Rankings

Alberta's historical landfill locations, digitized from three sources. The data originates from a 1982 survey by MacLaren Plansearch Lavalin, which ranked sites by potential environmental and human health risk. Subsequent evaluations by Associated Engineering in 1985 and digitization by Alberta Environment and Protected Areas contributed to this spatial dataset.

GeospatialZIPXMLHistorical LandfillsBenchmarkHealthcareLand UseAlberta+1

0 views

NLP & Text

SmolKalam: Arabic Conversational Supervised Fine-Tuning Dataset

SmolKalam is a quality-filtered Arabic supervised fine-tuning dataset built as an ensemble translation of HuggingFaceTB/smoltalk2. It covers multi-turn dialogue, reasoning traces, tool and function calling, and long-context examples. The dataset was produced by AdaMLLab and last updated on June 22, 2026.

TextReasoning TracesConversational AiArabic LanguageMultiturn DialogueSupervised Fine Tuning+1

0 views

NLP & Text

Beyond Belief Change: LLM-Generated Counterargument Experiments on Political Attitudes

American Political Science Review Dataverse hosts replication data for a study on political persuasion and belief relevance. The research involved experiments with two large online convenience samples, using large language models to generate counterarguments targeting specific beliefs. Yamil Velez authored the dataset, which was last updated on June 18, 2026.

TabularBelief Attitude ChangeLlm ExperimentsPublic OpinionPolitical PersuasionSynthetic+1

0 views

NLP & Text

BCAI Finance Kor Embedding Triplet: Korean Financial Text Triplets

45,394 triplets of Korean financial text for fine-tuning sentence-embedding models, with graded relevance labels. The dataset was created by BCCard/BCAI using FAISS top-K and Claude Sonnet LLM judge for hybrid hard-negative mining. It was last updated on June 11, 2026.

TextKoreanMachine LearningTripletEmbeddingFinanceNatural Language ProcessingText Embeddings+1

0 views

NLP & Text

Targeting a Pleckstrin Homology Domain with a Lysine-Reactive Covalent Binder

27 crystal structures from a structure-binding relationship study for Bruton’s Tyrosine Kinase (BTK) inhibition. The dataset, authored by Rebekah M. West and last updated on 2026-05-14, explores targeting the PH domain with a covalent fragment that modifies a lysine in the PIP3 binding site.

TabularCSVPh DomainBenchmarkProtein InhibitionStructural BiologyBtk KinaseCovalent Binding+1

0 views

NLP & Text

NarraDolma: Narrative Feature Vectors for the 3-Trillion-Token Dolma Corpus

NarraDolma provides a large-scale narrative characterization of the Dolma pretraining corpus. It contains approximately 3 million passages drawn from about 785,000 unique documents across all 12 Dolma sub-corpora, each labeled with a fine-grained narrative feature vector. The dataset was created by teagrjohnson and is intended as a resource for studying how narrative qualities are distributed in web-scale data.

TextLlm PretrainingDolmaLarge ScaleNatural Language ProcessingNarrative AnalysisText Corpus+1

0 views

NLP & Text

Porous Media Permeability Models with Grain Roundness and Porosity Data

Jiabin Dong published a collection of datasets on figshare in April 2026 for pore-scale numerical studies. The 706.3 KB collection includes files analyzing the synergistic control of grain roundness and volume on the permeability of fractal porous sandstone. It contains datasets for constructing hierarchical Voronoi porous media, comparing theoretical and actual porosity, and relating roundness to permeability via Lattice Boltzmann Method simulations.

TabularTextExcelPorous MediaDigital ReconstructionPermeabilityLattice BoltzmannPermeability ModelingNumerical SimulationSyntheticGrain Roundness+1

0 views

NLP & Text

Onshore Energy Security Program: Geothermal Energy Project Scope and Progress

A 2006 initiative funded with $58.9 million over five years for Geoscience Australia to acquire pre-competitive geoscience data. The program, delivered in collaboration with States and Territories, aims to attract investment in onshore energy exploration, including geothermal, petroleum, uranium, and thorium. The description outlines the program's structure and the specific Geothermal Energy Project's focus on mapping crustal temperature distribution.

Text🇦🇺 AustraliaHeat FlowLarge ScaleGeothermal energyGeoscience+1

0 views

NLP & Text

Media Coverage Analysis of Female Gubernatorial Candidates in Mexico, 2021

A database supporting academic articles analyzing news coverage of female gubernatorial candidates during Mexico's 2021 election campaigns. The dataset is 630.0 KB in size, stored in an XLSX file, and was created by Edrei Álvarez-Monsiváis. It was last updated on 2026-05-15.

TabularExcelMedia CoverageMexico ElectionsNews AnalysisGender PoliticsPolitical Communication+1

0 views

NLP & Text

Factorial Linear Mixed Model Test Results for Plant Traits

A 5.5 KB Excel file containing statistical test results from a factorial general linear mixed model analysis. The dataset reports F ratios and p-values for fixed and random effects on four plant traits: total height, flower number, leaf width, and pistil length. It was authored by Arezoo Fani and last updated on 2026-05-15.

TabularExcelPlant BiologyLinear Mixed Model+1

0 views

NLP & Text

Plant Trait Statistical Test Results from a Factorial Linear Mixed Model

Statistical test results from a factorial general linear mixed model fitted to four plant traits: total height, flower number, leaf width, and pistil length. The dataset reports F ratios and p-values for fixed effects (parental treatment, offspring treatment, and their interaction) and random effects (maternal plant and block). The author is Arezoo Fani, and the data was last updated on May 15, 2026.

TabularExcelPlant BiologyLinear Mixed Model+1

0 views

NLP & Text

Drug-ACE: Applicability Conditions for Therapeutic Drug-Disease Relations

A text dataset for biomedical information extraction, developed for the ACL 2026 Findings paper 'Applicability Condition Extraction for Therapeutic Drug-Disease Relations'. The dataset is authored by B1tta and was last updated on June 18, 2026. It focuses on identifying context-specific conditions under which a drug is therapeutically effective for a disease.

TextDrug Disease RelationsText ExtractionTherapeutic DrugsHealthcareClinical Decision SupportApplicability ConditionsBiomedical Nlp+1

0 views

PreviousPage 238 of 2211Next