DataSalon

Discover quality datasets for AI training — aggregated from 40+ platforms, curated by AI.

ProductSearch Datasets Browse Topics Rankings Community API / MCP

ResourcesDocumentation Blog Changelog Status

LegalPrivacy Policy Terms of Service Cookie Policy

NLP & Text Datasets | DataSalon

All Categories

📝

NLP & Text

Text classification, translation, QA, summarization, dialogue, sentiment analysis, language modeling, text corpora

43,370 datasets

NLP & Text

Spatial inventory of publications for priority species in relation to Offshore Renewable E

Australia's National Environmental Science Program (NESP) compiled a spatial index of environmental literature for 100 threatened and migratory marine species in relation to Offshore Renewable Energy (ORE) areas. The inventory records study locations, methodologies, and potential impacts of ORE infrastructure on species like birds, cetaceans, and turtles. Data was sourced from a systematic literature review and observation repositories including BirdLife Australia and the Atlas of Living Australia.

Geospatial🇦🇺 AustraliaEnvironmental LiteratureOffshore Renewable EnergyMarine SpeciesSpatial Inventory+1

0 views

NLP & Text

Nemotron Personas Belgium: Multilingual Synthetic Personas Grounded in Real Distributions

Nemotron Personas Belgium uses a compound AI approach to generate multilingual Belgian personas. The personas are grounded in real-world distributions, suggesting they model demographic or behavioral patterns. Created by NVIDIA and last updated on June 17, 2026, this dataset is intended for AI training tasks.

TextMultilingualMultilingual PersonasAi TrainingDemographicsSynthetic DataBelgium+1

0 views

NLP & Text

ASR-KCSC: 5.22 Hours of Korean Conversational Speech

ASR-KCSC is an open-source Korean conversational speech corpus containing 5.22 hours of transcribed audio. The data consists of 22 conversations between seven pairs of speakers recorded on mobile devices in indoor environments. Author MagicHub released the dataset on Hugging Face, with a last recorded update in June 2026.

AudioKorean LanguageSpeech CorpusNatural Language ProcessingConversational SpeechAutomatic Speech Recognition+1

0 views

NLP & Text

Chimera-XTRM: Synthetic Offensive Security Data for AI Training

Chimera-XTRM is a synthetic dataset engineered for fine-tuning Large Language Models in advanced Red-Team operations. The dataset was created by author Umranz and was last updated on June 21, 2026. It is intended strictly for authorized security research and defensive training.

TextRed TeamOffensive SecurityAi TrainingLlm Fine TuningSynthetic DataSynthetic+1

0 views

NLP & Text

Geomorphology of Victoria: A Hierarchical Landform Framework at Three Scales

A hierarchical framework of geomorphological spatial entities at three tiers, with Tier 1 containing 8 Divisions, Tier 2 containing 34 categories, and Tier 3 containing 95 categories. The dataset, created by the Department of Energy, Environment and Climate Action, provides a spatial system to assist planning, monitoring and reporting for natural resource management in Victoria and Australia. It was last updated on 2026-04-08.

GeospatialSoil ErosionGeomorphologySpatial HierarchyLandform ClassificationNatural resource management+1

0 views

NLP & Text

VNP47MOD: VIIRS Fire Combustion Efficiency from Suomi NPP

85 variables provide fire detection and retrievals of Fire Radiative Power (FRP), fire Visible Energy Fraction (VEF), and Modified Combustion Efficiency (MCE). The NASA/NOAA Suomi NPP VIIRS FILDA-2 product is generated in 6-minute orbit segments at a 750-meter spatial resolution, designed to detect smaller and cooler fires using visible band observations at night. This dataset supports analysis of fire characteristics and combustion efficiency globally.

Time SeriesGeospatialCombustion EfficiencyEarth Science Surface Thermal Properties Land SurfSatellite ImageryViirsEarth Science Ecological Dynamics Biosphere Fire EEarth ScienceFire Detection+1

0 views

NLP & Text

VNP17A2: Global Vegetation Productivity at 500m Resolution

Global satellite-derived data provides cumulative 8-day composites of Gross Primary Productivity (GPP) and Net Photosynthesis (PSN) at a 500-meter spatial resolution. The dataset, based on the radiation use efficiency concept, is designed as an input for models calculating terrestrial energy, carbon, and water cycle processes. It contains three primary variables for GPP and PSN alongside a quality control layer.

Time SeriesGeospatialGross Primary ProductivityNet PhotosynthesisSatellite Remote SensingComputer VisionCarbon cycleVegetation Biogeochemistry+1

0 views

NLP & Text

Wetland Inventory Pilots for Alberta Using Satellite Imagery and LiDAR

Alberta, Canada, contains wetland inventory data for four pilot study areas totaling approximately 39,045 km². The Government of Alberta, Ducks Unlimited Canada, and Alberta Biodiversity Monitoring Institute collaborated to develop this inventory using Earth Observation imagery and machine learning techniques. The dataset identifies wetland class and form according to the Alberta Wetland Classification System.

GeospatialZIPXMLMachine LearningAlbertaWetland Mapping+1

0 views

NLP & Text

Seasat Scatterometer: Global Monthly Ocean Wind Stress (1978)

Seasat-A Scatterometer (SASS) data provides monthly averaged ocean surface wind stress from July to October 1978. The data is gridded on a 2.5-degree global grid, with vector wind stress stored in dynes per square centimeter. It is derived from 96 days of SASS vector winds processed to remove directional ambiguities using a GSFC atmospheric model.

TabularGeospatialOcean WindsEarth ScienceSeasatWind StressEarth Science Ocean Winds Oceans Wind Stress Ocean+1

0 views

NLP & Text

National Core Library Statistics for Canada, 1994-1999

133 variables across 2,050 cases capture key indicators of library services in Canada. The data were collected for the survey years 1994, 1995, 1996, and 1999 by the National Library of Canada in collaboration with library associations. The program was dissolved after the publication of the 1999 statistical report in 2002.

Tabular🇨🇦 CanadaAcademic LibrariesLibrary StatisticsPublic ServicesSynthetic+1

0 views

NLP & Text

5028 Block 4 (Northeast): Reduced Magnetic Survey Data for Narryer, 2024

415,090 line-kilometres of Total Magnetic Intensity data were acquired over the Narryer region in 2024. The dataset is processed with corrections for diurnal variation, geomagnetic reference fields, and levelling to highlight subsurface geology. It was published by Geoscience Australia Data and last updated in May 2026.

Geospatial🇦🇺 AustraliaGeophysicsMineral explorationMagnetic Survey+1

0 views

NLP & Text

5028 Block 3 (Northwest): Raw-Edited Magnetic Point Data from Narryer Survey

Total Magnetic Intensity (TMI) point-located data measures variations in the Earth's magnetic field. This line dataset from the Narryer survey in Western Australia was acquired in 2024 by the WA Government and consists of 415,090 line-kilometres of data. The raw edited data includes measurements such as raw TMI, compensated TMI, diurnal, fluxgate magnetometer, raw altimeter heights, and ellipsoidal GNSS heights.

Geospatial🇦🇺 AustraliaGeological mappingGeophysicsMineral explorationMagnetic Survey+1

0 views

NLP & Text

VNP21A1D: Daily Global Land Surface Temperature from VIIRS

NASA/NOAA's VIIRS/NPP Land Surface Temperature/Emissivity Daily L3 Global 1km SIN Grid Day V002 dataset provides daily, gridded estimates of land surface temperature and emissivity. The product is compiled from daytime VIIRS swath data, resampled to a 1-kilometer sinusoidal grid, and uses an algorithm compatible with MODIS for continuity. It contains seven science datasets including LST, quality control, emissivity for three spectral bands, view zenith angle, and observation time.

Time SeriesGeospatialEarth Science Surface Radiative Properties Land SuSurface EmissivityEarth Science Surface Thermal Properties Land SurfViirsLand Surface TemperatureSatellite Remote SensingComputer VisionEarth Science+1

0 views

NLP & Text

Dark Septate Endophyte Fungi and Soil Nutrients for Ulmus Pumila in Inner Mongolia

A study of 200 dark septate endophytic (DSE) fungal strains isolated from Ulmus pumila L. roots across three sandy lands in eastern Inner Mongolia. The dataset includes fungal species composition, colonization rates, and key rhizosphere soil nutrient measurements. It was authored by Yunxia Ma and last updated in April 2026.

TabularPlant Microbe InteractionsSandy EcosystemsInner MongoliaMycologySoil Nutrients+1

0 views

NLP & Text

VJ121A1D: Daily Global Land Surface Temperature and Emissivity at 1km

Global daily land surface temperature and emissivity data at a 1-kilometer resolution, derived from NOAA-20 VIIRS satellite observations. The dataset is produced by averaging multiple cloud-free, high-accuracy observations per grid cell, weighted by observation coverage, and is algorithmically compatible with NASA's MODIS products for continuity. It contains seven science datasets including temperature, quality control, emissivity for three spectral bands, view angle, and observation time.

Time SeriesGeospatialEarth Science Surface Radiative Properties Land SuEarth Science Surface Thermal Properties Land SurfViirsLand Surface TemperatureSatellite Remote SensingComputer VisionEarth ScienceEmissivity+1

0 views

NLP & Text

Colombian National Security Code Infractions by Municipality

Comparendos aplicados por el Código Nacional de Seguridad y Convivencia Ciudadana records infractions under Colombia's National Security and Citizen Coexistence Code. The dataset is hosted on the datos.gov.co platform via Socrata and was last updated on 2026-05-18. It likely contains records of official orders issued by the National Police.

TabularCSVXMLJSONLaw EnforcementColombiaTabular DataPublic Safety+1

0 views

NLP & Text

OpenAI Comic Strips: 3,000 Generated Images for Spatial Grounding Research

500 six-panel comic strips generated with OpenAI's gpt-image-1, totaling 3,000 images. Each strip is paired with structured metadata including art style, a recurring protagonist, and a caption for every panel. The dataset was created by baulab to study spatial grounding in vision-language models, specifically tracking attention across multi-panel images.

ImageMultimodalSpatial GroundingVision Language ModelsComputer VisionGenerated ImagesComic StripsSynthetic+1

0 views

NLP & Text

ENERGETIC: Errata Data and Scripts for U280 FPGA Power Measurements

Errata data and scripts correct a methodological error in the original ENERGETIC project report concerning power and energy consumption measurements of a U280 FPGA card. The dataset, authored by Michael Bane, was last updated on May 28, 2026. It is a small archive of 672.6 KB containing revised data and analysis scripts.

TextTabularZIPErrataEnergy ConsumptionHardware BenchmarkingFpga+1

0 views

NLP & Text

SIHIPCE: Antarctic Sea-Ice Under-Ice Imagery and Auxiliary Data, 2018-2019

Three terabytes of high-resolution imagery and auxiliary data collected from Antarctic fast ice at Cape Evans during November and December 2018-2019. The dataset was acquired by the IMAS/AGP under-ice HI system and a custom ice core scanner for the 'On Thin Ice' grant, a collaboration between AGP and NZARI. It includes in-situ transects under natural light, ex-situ ice core scans, irradiance measurements, fluorometric samples, and media footage.

ImageMultimodalAntarcticSea icePhotogrammetryUnder Ice ImagingMarine Biology+1

0 views

NLP & Text

Port Fairy Wave Energy Bathymetry Survey from October 2015

A bathymetry survey acquired by Deakin University over two days in October 2015 (14/10/2015-15/10/2015). The survey was conducted onboard the Motor Vessel Yolla using a Kongsberg EM2040c sonar system and is managed by the Australian Ocean Data Network.

GeospatialZIPMarine SurveyOcean EnergyCoastal GeomorphologyBathymetry+1

0 views

PreviousPage 167 of 2166Next