DataSalon

Discover quality datasets for AI training — aggregated from 40+ platforms, curated by AI.

ProductSearch Datasets Browse Topics Rankings Community API / MCP

ResourcesDocumentation Blog Changelog Status

LegalPrivacy Policy Terms of Service Cookie Policy

NLP & Text Datasets | DataSalon

All Categories

📝

NLP & Text

Text classification, translation, QA, summarization, dialogue, sentiment analysis, language modeling, text corpora

43,995 datasets

NLP & Text

Historical Landfills in Alberta with Environmental Risk Rankings

Alberta's historical landfill locations, digitized from three sources. The data originates from a 1982 survey by MacLaren Plansearch Lavalin, which ranked sites by potential environmental and human health risk. Subsequent evaluations by Associated Engineering in 1985 and digitization by Alberta Environment and Protected Areas contributed to this spatial dataset.

GeospatialZIPXMLHistorical LandfillsBenchmarkHealthcareLand UseAlberta+1

0 views

NLP & Text

SmolKalam: Arabic Conversational Supervised Fine-Tuning Dataset

SmolKalam is a quality-filtered Arabic supervised fine-tuning dataset built as an ensemble translation of HuggingFaceTB/smoltalk2. It covers multi-turn dialogue, reasoning traces, tool and function calling, and long-context examples. The dataset was produced by AdaMLLab and last updated on June 22, 2026.

TextReasoning TracesConversational AiArabic LanguageMultiturn DialogueSupervised Fine Tuning+1

0 views

NLP & Text

Beyond Belief Change: LLM-Generated Counterargument Experiments on Political Attitudes

American Political Science Review Dataverse hosts replication data for a study on political persuasion and belief relevance. The research involved experiments with two large online convenience samples, using large language models to generate counterarguments targeting specific beliefs. Yamil Velez authored the dataset, which was last updated on June 18, 2026.

TabularBelief Attitude ChangeLlm ExperimentsPublic OpinionPolitical PersuasionSynthetic+1

0 views

NLP & Text

BCAI Finance Kor Embedding Triplet: Korean Financial Text Triplets

45,394 triplets of Korean financial text for fine-tuning sentence-embedding models, with graded relevance labels. The dataset was created by BCCard/BCAI using FAISS top-K and Claude Sonnet LLM judge for hybrid hard-negative mining. It was last updated on June 11, 2026.

TextKoreanMachine LearningTripletEmbeddingFinanceNatural Language ProcessingText Embeddings+1

0 views

NLP & Text

Targeting a Pleckstrin Homology Domain with a Lysine-Reactive Covalent Binder

27 crystal structures from a structure-binding relationship study for Bruton’s Tyrosine Kinase (BTK) inhibition. The dataset, authored by Rebekah M. West and last updated on 2026-05-14, explores targeting the PH domain with a covalent fragment that modifies a lysine in the PIP3 binding site.

TabularCSVPh DomainBenchmarkProtein InhibitionStructural BiologyBtk KinaseCovalent Binding+1

0 views

NLP & Text

NarraDolma: Narrative Feature Vectors for the 3-Trillion-Token Dolma Corpus

NarraDolma provides a large-scale narrative characterization of the Dolma pretraining corpus. It contains approximately 3 million passages drawn from about 785,000 unique documents across all 12 Dolma sub-corpora, each labeled with a fine-grained narrative feature vector. The dataset was created by teagrjohnson and is intended as a resource for studying how narrative qualities are distributed in web-scale data.

TextLlm PretrainingDolmaLarge ScaleNatural Language ProcessingNarrative AnalysisText Corpus+1

0 views

NLP & Text

Porous Media Permeability Models with Grain Roundness and Porosity Data

Jiabin Dong published a collection of datasets on figshare in April 2026 for pore-scale numerical studies. The 706.3 KB collection includes files analyzing the synergistic control of grain roundness and volume on the permeability of fractal porous sandstone. It contains datasets for constructing hierarchical Voronoi porous media, comparing theoretical and actual porosity, and relating roundness to permeability via Lattice Boltzmann Method simulations.

TabularTextExcelPorous MediaDigital ReconstructionPermeabilityLattice BoltzmannPermeability ModelingNumerical SimulationSyntheticGrain Roundness+1

0 views

NLP & Text

Onshore Energy Security Program: Geothermal Energy Project Scope and Progress

A 2006 initiative funded with $58.9 million over five years for Geoscience Australia to acquire pre-competitive geoscience data. The program, delivered in collaboration with States and Territories, aims to attract investment in onshore energy exploration, including geothermal, petroleum, uranium, and thorium. The description outlines the program's structure and the specific Geothermal Energy Project's focus on mapping crustal temperature distribution.

Text🇦🇺 AustraliaHeat FlowLarge ScaleGeothermal energyGeoscience+1

0 views

NLP & Text

Media Coverage Analysis of Female Gubernatorial Candidates in Mexico, 2021

A database supporting academic articles analyzing news coverage of female gubernatorial candidates during Mexico's 2021 election campaigns. The dataset is 630.0 KB in size, stored in an XLSX file, and was created by Edrei Álvarez-Monsiváis. It was last updated on 2026-05-15.

TabularExcelMedia CoverageMexico ElectionsNews AnalysisGender PoliticsPolitical Communication+1

0 views

NLP & Text

Factorial Linear Mixed Model Test Results for Plant Traits

A 5.5 KB Excel file containing statistical test results from a factorial general linear mixed model analysis. The dataset reports F ratios and p-values for fixed and random effects on four plant traits: total height, flower number, leaf width, and pistil length. It was authored by Arezoo Fani and last updated on 2026-05-15.

TabularExcelPlant BiologyLinear Mixed Model+1

0 views

NLP & Text

Plant Trait Statistical Test Results from a Factorial Linear Mixed Model

Statistical test results from a factorial general linear mixed model fitted to four plant traits: total height, flower number, leaf width, and pistil length. The dataset reports F ratios and p-values for fixed effects (parental treatment, offspring treatment, and their interaction) and random effects (maternal plant and block). The author is Arezoo Fani, and the data was last updated on May 15, 2026.

TabularExcelPlant BiologyLinear Mixed Model+1

0 views

NLP & Text

Drug-ACE: Applicability Conditions for Therapeutic Drug-Disease Relations

A text dataset for biomedical information extraction, developed for the ACL 2026 Findings paper 'Applicability Condition Extraction for Therapeutic Drug-Disease Relations'. The dataset is authored by B1tta and was last updated on June 18, 2026. It focuses on identifying context-specific conditions under which a drug is therapeutically effective for a disease.

TextDrug Disease RelationsText ExtractionTherapeutic DrugsHealthcareClinical Decision SupportApplicability ConditionsBiomedical Nlp+1

0 views

NLP & Text

Synthetic Cell Cluster Ground-Truth Parameters for Rosetta-Routine

Manually defined parameters serve as the ground-truth reference for generating synthetic cell-like clusters. The 5.5 KB XLS file contains a priori values controlling cluster shape, spread, orientation, and event number. Authored by Bradley Mason and last updated in May 2026, this dataset supports replication and accuracy assessment for the Rosetta-Routine modelling pipeline.

TabularExcelGround TruthRosetta RoutineSynthetic DataCell ClustersSynthetic+1

0 views

NLP & Text

Rosetta-Routine: Mapping of Statistical Measures to Cluster Generator Arguments

A 5.5 KB Excel file maps traditional descriptive statistical measures to conversion methods used by the Rosetta-Routine modeling algorithm. The mapping is intended to acquire information from unknown data and define corresponding cluster generator argument variables. Author Bradley Mason last updated the file on May 29, 2026, and it is shared under a CC-BY-4.0 license.

TabularExcelDescriptive StatisticsClustering AlgorithmRosetta RoutineStatistical MeasuresData Conversion+1

0 views

NLP & Text

SI-1 Cluster Data: Real and Synthetic Event-Level Measurements for Population Modelling

A 5.1 MB Excel file containing datasets used for figure generation and quantitative analyses in a manuscript. The data includes real and synthetic event-level measurements intended for population modelling. It was authored by Bradley Mason and last updated on 2026-05-29.

TabularExcelPopulation ModellingCluster AnalysisEvent Level MeasurementsSynthetic DataSynthetic+1

0 views

NLP & Text

Infection Rate Estimates: Predictive Skill of PPTs Scored with CRPS

A 5.5 KB Excel file uploaded to figshare by Wyatt H. Bridgman on May 29, 2026. It contains data on the predictive skill of Probabilistic Predictive Trajectories (PPTs) generated using different infection-rate estimation procedures. The PPTs are scored using the Continuous Ranked Probability Score (CRPS) and have units of case counts.

TabularExcelPredictive SkillEpidemiologyCase CountsSpatial ModelInfection RateSynthetic+1

0 views

NLP & Text

Waste Data Interrogator: UK Regulated Facility Returns

Around 6,000 regulated waste management facilities in the UK report annual data on waste quantities and types received and sent on from site. This data, collected since 2006 by the Environment Agency, is used for compliance monitoring and has historically supported planning by the EC, DEFRA, and local authorities. It is published in multiple formats including an MS Access interrogator, Excel extracts, and regional summary tables.

TabularTime SeriesZIPExcelRegulated FacilitiesEnvironmental monitoringWaste ManagementWaste DataWaste Data Interrogator+1

0 views

NLP & Text

Waste Data Interrogator 2017: UK Facility Waste Quantities

The Waste Data Interrogator 2017 dataset contains annual waste quantity and type data reported by regulated waste management facilities in the UK. It includes data from around 6,000 sites, collected by the Environment Agency for compliance monitoring and planning. The data is provided in multiple formats including an MS Access interrogator and Excel extracts.

TabularZIPExcelFacility OperationsWaste ManagementUk Government Data+1

0 views

NLP & Text

DBCata: Adsorption Structures and Model Checkpoints for Catalyst Screening

4.3 GB of cleaned adsorption structures from CatHub data, used for training the DBCata model. The dataset includes model checkpoints, fine-tuning scripts, and results for out-of-distribution testing. It was authored by Songze Huo and last updated on May 25, 2026.

TabularMultimodalCSVJSONMachine LearningAdsorption StructuresCatalyst ScreeningComputational ChemistryMaterials Science+1

0 views

NLP & Text

ENERGY STAR Certified Residential Refrigerators with Efficiency Metrics

ENERGY STAR Certified Residential Refrigerators meet specific program requirements effective from September 15, 2014 or August 5, 2021. The dataset, sourced from data.energystar.gov, includes model specifications and efficiency metrics such as Annual Energy Use and Percent Less Energy Use than US Federal Standard. It was last updated on April 3, 2026.

TabularCSVXMLJSONConsumer RefrigerationResidential RefrigeratorsAppliancesEnergy efficiencyRefrigeratorsCertificationConsumer Products+1

2 views

PreviousPage 226 of 2196Next