DataSalon

Discover quality datasets for AI training — aggregated from 40+ platforms, curated by AI.

ProductSearch Datasets Browse Topics Rankings Community API / MCP

ResourcesDocumentation Blog Changelog Status

LegalPrivacy Policy Terms of Service Cookie Policy

NLP & Text Datasets | DataSalon

All Categories

📝

NLP & Text

Text classification, translation, QA, summarization, dialogue, sentiment analysis, language modeling, text corpora

44,799 datasets

NLP & Text

Supplementary Data on Azoxystrobin's Molecular Mechanism in Oral Leukoplakia

A 1.2 MB PDF file authored by Wenjing Li, last updated on 2026-04-20. The file contains supplementary data for a study investigating how the fungicide Azoxystrobin binds to specific sites on the Peroxiredoxin 1 protein to induce mitochondrial dysfunction and apoptosis in oral leukoplakia cell lines.

TextOral CancerApoptosisDrug MechanismMitochondrial FunctionMolecular Biology+1

0 views

NLP & Text

Outlet-Level Racial Bias Analysis in South African COVID-19 News

27,140 COVID-19 vaccination news articles from 39 South African outlets form the corpus for this racial bias study. Researcher Nnaemeka Ohamadike trained an ensemble of Word2Vec models to embed each outlet's language and measure its association with racial stereotype vocabularies. The dataset was published on April 5, 2026.

TextMedia BiasNews AnalysisSouth AfricaRacial BiasNatural Language ProcessingWord Embeddings+1

0 views

NLP & Text

Local Code Arena Starcoder2 15B: MBPP Benchmark Telemetry

ShahzebKhoso hosts raw evaluation metrics, execution telemetry logs, and structural syntax outputs from running the Mostly Basic Python Problems (MBPP) benchmark against the StarCoder2 15B base model. The dataset captures telemetry from conversational evaluation loops to establish a baseline for unaligned foundational weights. It was last updated on May 28, 2026.

TabularBenchmark EvaluationLlm TelemetryBenchmarkCode GenerationPython Problems+1

0 views

NLP & Text

Fallacy: Logical Fallacy Detection Dataset with 138k+ Examples

A dataset for detecting 14 types of logical fallacies in English text, created by kuwrom. It contains 138,574 rows for multi-class classification and 25,068 instruction examples for fine-tuning. The dataset was last updated on June 3, 2026.

TextText ClassificationLogical FallacyNatural Language Processing+1

0 views

NLP & Text

Geochemical and Mineralogical Data for Kamativi Lithium Pegmatites, Zimbabwe 2018

October 2018 field observations and laboratory analyses for rock samples from the Kamativi area of Zimbabwe. The dataset includes whole-rock geochemical data from ICP-MS and mineralogical data from XRD and SEM-EDS, collected by the British Geological Survey. Data were gathered to research the internal evolution and crystallisation of lithium pegmatites.

TabularGeologyMineralogyZimbabweLithium PegmatitesGeochemistry+1

0 views

NLP & Text

Stratum-FFHQ: 70,000 Synthetic Human Faces with Multi-Layer Annotations

70,000 synthetic human face images generated by the stratum-hq tool. The dataset includes multiple annotation layers such as captions, depth maps, normals, pose, segmentation, and embeddings from models like DINOv3 and T5. It was created by author 'timlawrenz' and last updated on the platform in May 2026.

ImageMultimodalComputer VisionSynthetic ImagesSynthetic+1

0 views

NLP & Text

Montreal Agglomeration Heritage and Landscape Planning Data

This dataset provides land use and development planning information for the Montreal agglomeration, focusing on heritage and landscape features. It includes mapped data for built or archaeological heritage, emblematic landscapes, and views of interest to guide sustainable urban development decisions. The data originates from section 2.3 of the official Land Use and Development Plan.

0 views

NLP & Text

Simplified Hydro-Generator Shaft Model with Dual Rotors and Excitation Analysis

A 493.7 KB Excel dataset containing a simplified model of a hydro-generator shaft system with two rotors. The model, created by Tengjiao Guo, incorporates electromagnetic, mechanical, and flow excitation. It uses a 'shape guidance' strategy based on test signal libraries to analyze axis trajectories and correlate them with frequency components and excitation sources.

TabularExcelMechanical ModelingSimulation DataHydro GeneratorRotor DynamicsExcitation Analysis+1

0 views

NLP & Text

Forster Pacific Palms Cape Hawke Seabed Backscatter and Bathymetry, 2019-2022

New South Wales, Australia, seabed data collected by the NSW Department of Planning and Environment from March 2019 to August 2022. The dataset contains 32-bit floating point geotiff files of bathymetry and backscatter in 5-meter resolution, derived from multibeam sonar surveys. It was created to provide a baseline and map seabed type distribution as part of the SeabedNSW program.

ImageGeospatialZIPBenchmarkBackscatterMarine SurveyCoastal MappingBathymetry+1

0 views

NLP & Text

Montreal Agglomeration Land Use and Transport Planning Data

Urban planning data from the Montreal agglomeration's Land Use and Development Plan outlines parameters for sustainable development decisions. The dataset includes thematic information on transport, compact neighborhoods, and economic development, accessible via an interactive map. Row and column counts are not specified.

0 views

NLP & Text

Sentinel-2 Satellite Mosaics for Quebec 2018-2020

Government and Municipalities of Québec provides three annual satellite mosaics covering the entire territory of Quebec. The mosaics contain multispectral imagery from the Copernicus Sentinel-2 mission for 2018, 2019, and 2020, featuring blue, near infrared, and short wave infrared spectral bands.

0 views

NLP & Text

Geochemical Data for 1.95 Ga S-Type Granites from the Helanshan Complex

Jie-Long Shen published geochemical and isotopic data for Paleoproterozoic S-type granites from the Helanshan Complex in the North China Craton. The dataset includes whole-rock geochemistry, zircon U-Pb dating results, and Nd isotopic data for samples with crystallization ages around 1.95 Ga. It was last updated on 2026-04-11 and is available under a CC-BY-4.0 license.

TabularExcelNorth China CratonPaleoproterozoicPetrologyGranite AnalysisGeochemistry+1

0 views

NLP & Text

MAIA Surface Monitor: Global Particulate Matter Measurements

NASA's Atmospheric Science Data Center processes particulate matter measurements from a global in-situ surface monitoring network. The MAIA Surface Monitor Stage 0 files contain these processed PM data as an ancillary dataset. Columns likely include time-series measurements of PM concentrations from monitoring stations worldwide.

TabularTime SeriesEnvironmental scienceAir QualityPARTICULATE MATTERNasaSynthetic+1

0 views

NLP & Text

Geomorphic Features of the Lord Howe Island and Balls Pyramid Shelves

Lord Howe Island and Balls Pyramid shelves are classified by geomorphic features and shelf region. The dataset provides information on the size, extent, and type of features, including submerged fossil reefs, ridges, sandy basins, and paleochannels. It was created by visually interpreting and digitizing broad seafloor features in ArcGIS, extending upon prior work by Linklater et al. (2015).

Geospatial🇦🇺 AustraliaCoral reefsGeomorphologyMarine Geology+1

0 views

NLP & Text

Regulation of physical activity and energy expenditure through Phf6 in the medial preopti

335.0 MB of source data and original figures supporting a neuroscience study on Phf6 gene function in the medial preoptic area. The dataset, authored by Jingjie Wang and shared under CC-BY-4.0, includes files for figures 1-7 and supplemental figures 1-9. It was last updated on May 20, 2026.

TabularExcelPhf6Energy ExpenditureNeurosciencePhysical ActivityPreoptic Area+1

0 views

NLP & Text

IUB Information Assets Registry: Public Information Inventory

An inventory of public information generated, obtained, acquired, transformed, and controlled by the Institución Universitaria de Barranquilla (IUB). The dataset includes columns for document series, format, language, description, and period. It was last updated on 2026-05-18 16:37:35 and is hosted on the Colombian open data portal www.datos.gov.co.

TabularCSVXMLJSONGovernment DataDocument ManagementPublic InformationInventory+1

0 views

NLP & Text

Local Code Arena Starcoder2 7B: MBPP Benchmark Telemetry

Raw evaluation metrics, execution telemetry logs, and structural syntax outputs from running the Mostly Basic Python Problems (MBPP) benchmark against the StarCoder2 7B base model. The dataset documents behavioral dynamics of mid-tier foundational weights in automated conversational evaluation workflows. It was authored by ShahzebKhoso and last updated on May 28, 2026.

TabularBenchmark EvaluationLlm TelemetryBenchmarkCode GenerationPython Problems+1

0 views

NLP & Text

Kyrgyzstan Refugee Cash Assistance Survey Feedback 2023

22 records of refugee feedback collected by UNHCR in Kyrgyzstan in 2023. The dataset captures feedback on the quality, sufficiency, utilization, and effectiveness of cash-based assistance. UNHCR uses this Post Distribution Monitoring to improve the relevance and quality of support provided to Persons of Concern.

TabularRefugee AssistanceSurveyCash Voucher Assistance CvaCash Voucher AssistanceNeeds AssessmentLivelihoods+1

0 views

NLP & Text

Practitioner Perspectives on Unowned Cat Management in Cyprus, Greece, and Portugal

44 practitioners across Cyprus, Greece, and Portugal provide frontline insights into unowned cat population and welfare management. The qualitative analysis, authored by Jamie L. DeLeeuw and published in April 2026, examines systemic challenges like unreliable funding, fragmented support, and weak legal frameworks. Findings reveal shared issues of overpopulation and welfare harms, alongside country-specific variations in governance and implementation.

TextTabularHealthcareAnimal welfareCross National AnalysisPopulation ManagementQualitative Research+1

0 views

NLP & Text

AusBathyTopo: Northern Australia 30m Bathymetric Depth Model (2018)

Geoscience Australia Data compiled this 30-meter resolution Digital Elevation Model (DEM) of bathymetry for Northern Australia in 2018. The dataset covers a continental shelf over 400 km wide and approximately 1500 km long, including coral reefs, sand cays, and slope canyons. Source data includes multibeam surveys, airborne LiDAR, satellite-derived bathymetry, and an intertidal elevation model, all edited and standardized to WGS84/MSL datums.

GeospatialZIPDigital Elevation ModelMarine GeologyNorthern AustraliaBathymetry+1

0 views

PreviousPage 399 of 2236Next