DataSalon

Discover quality datasets for AI training — aggregated from 40+ platforms, curated by AI.

ProductSearch Datasets Browse Topics Rankings Community API / MCP

ResourcesDocumentation Blog Changelog Status

LegalPrivacy Policy Terms of Service Cookie Policy

NLP & Text Datasets | DataSalon

All Categories

📝

NLP & Text

Text classification, translation, QA, summarization, dialogue, sentiment analysis, language modeling, text corpora

42,093 datasets

NLP & Text

Chinese HIV/AIDS Media Discourse Analysis from 2010 to 2024

A corpus of news articles published between 2010 and 2024 analyzing media discourse on HIV/AIDS in China. The dataset, created by Yuhang Li and last updated in 2026, employs topic modeling and collocation analysis to identify thematic communities, terminology for people living with HIV, and conceptual metaphors. It reveals a discursive shift towards political narratives and the persistence of stigmatizing language.

Tabular🇨🇳 ChinaExcelMedia AnalysisHiv Aids DiscourseHealthcareLarge ScaleNatural Language ProcessingText CorpusPublic Health+1

0 views

NLP & Text

HIV/AIDS Metaphors and Terminology in Chinese News Articles, 2010-2024

A corpus of Chinese news articles published between 2010 and 2024 analyzed for HIV/AIDS discourse. The dataset, created by Yuhang Li and shared under CC-BY-4.0, contains extracted thematic networks, 19 categories of terminology for people living with HIV, and 12 categories of HIV/AIDS metaphors.

Tabular🇨🇳 ChinaExcelHIV-AIDSHealthcareMetaphor AnalysisMedia DiscourseLarge ScaleNatural Language ProcessingText Corpus+1

0 views

NLP & Text

PLHIV Terminology and Metaphors in Chinese Media Discourse, 2010-2024

Yuhang Li's dataset contains terminology and metaphor usage for people living with HIV (PLHIV) extracted from a large-scale corpus of Chinese news articles published between 2010 and 2024. The analysis identifies 19 categories of PLHIV terminology, 12 categories of HIV/AIDS metaphors, and 48 distinct topics across five thematic communities. The dataset is stored in an XLS file and was last updated on 2026-05-13.

Tabular🇨🇳 ChinaExcelMedia AnalysisHiv Aids DiscourseHealthcareLarge ScaleNatural Language ProcessingTerminology+1

0 views

NLP & Text

Ablation Study Results for Federated Multimodal Fusion Algorithms

5.5 KB of ablation study results from a federated learning framework for multimodal data fusion. The dataset likely contains experimental metrics comparing a novel tensor-based method against existing approaches on benchmarks like TREC2017 and CMU-MOSI. It was authored by Li Wan and published on figshare under a CC-BY-4.0 license in May 2026.

TabularAudioExcelTensor DecompositionPrivacy AwareSentiment AnalysisBenchmarkMultimodal FusionNatural Language ProcessingFederated Learning+1

0 views

NLP & Text

CMU-MOSI: Multimodal Sentiment Benchmark Performance Results

Experimental results comparing multimodal fusion methods on the CMU-MOSI sentiment benchmark. The dataset likely contains performance metrics from a federated learning framework that uses tensor decomposition for privacy-aware training. It was authored by Li Wan and uploaded to figshare on 2026-05-06.

AudioMultimodalExcelTensor DecompositionBenchmark PerformanceSocial MediaBenchmarkMultimodal Sentiment AnalysisNatural Language ProcessingFederated Learning+1

0 views

NLP & Text

MAP Values: Federated Learning Algorithm Performance on Multimodal Benchmarks

Li Wan published a federated learning framework for multimodal data fusion on figshare in May 2026. The dataset likely contains algorithm performance metrics, specifically Mean Average Precision (MAP) values, from experiments on the TREC2017 Precision Medicine Track and CMU-MOSI sentiment benchmarks. The file is 5.5 KB in size.

AudioMultimodalExcelTensor DecompositionSocial MediaSentiment AnalysisBenchmarkMultimodal FusionNatural Language ProcessingFederated Learning+1

0 views

NLP & Text

Interface Identification Constrained by Local-to-Nonlocal Coupling

A dataset supporting research into numerical methods for nonlocal physical models. It contains results from an energy-based Local-to-Nonlocal coupling used as a constraint for an interface identification problem. The dataset was authored by Matthias Schuster and last updated on May 14, 2026.

TabularNumerical MethodsComputational MathematicsNonlocal ModelsInterface Identification+1

0 views

NLP & Text

Hobart Interim Planning Scheme 2015: Hazard and Development Overlays

City of Hobart's 2015 Interim Planning Scheme defines spatial overlays for hazards like landslides, coastal inundation, and climate change. These geospatial layers provide a general indication of regulated areas for land development and infrastructure projects. The data is intended for preliminary planning, with site-specific investigation recommended for final decisions.

GeospatialZIPCSVLandslideEngineeringInfrastructure PlanningClimate ChangeHazardDevelopmentLand UsePlanning OverlaysCoastalInfrastructurePlanningChipsEnvironmentHazard ZoningInundationInterim Planning SchemeOverlaysCity Of Hobart+1

0 views

NLP & Text

Math Curated Dataset: 50,944 Generated Math Problem Responses

50,944 records of generated responses to math and word-problem prompts. The dataset was prepared by User01110 from a local JSON file and published on Hugging Face in Parquet format for viewer compatibility. It was last updated on June 13, 2026.

TabularGenerated TextLanguage ModelsMathematicsQuestion AnsweringSynthetic+1

0 views

NLP & Text

Filtering and Smoothing in State-Space Models with Multiple Regimes

A paper and associated materials presenting improved Bayesian filtering techniques and a novel smoother for regime-switching state-space models. The work assesses performance using a New Keynesian DSGE model and three filters, with simulation results showing speed and accuracy improvements. The author is Nigar Hashimzade, and the materials were last updated in April 2026.

TabularZIPState Space ModelsBayesian FilteringMacroeconomic Analysis+1

0 views

NLP & Text

German Adult Survey on Weight Management Awareness and Healthcare Utilisation

figshare admin karger published survey data from 2065 adults with overweight or obesity in Germany on 2026-05-05. The data likely contains responses on awareness, use, interest, and barriers regarding primary care consultations, behavioural programmes, and pharmacotherapy for weight management. The dataset is a 4.1 MB PDF file licensed under CC-BY-4.0.

Tabular🇩🇪 GermanyHealthcare SurveyWeight ManagementHealthcareObesity CarePublic Health+1

0 views

NLP & Text

Intermolecular Interaction Energies for Linear Alkane Dimers (N=1-18)

A 2026 theoretical study by Chenhui Wang provides high-accuracy interaction energies for linear alkane dimers (C_n H_{2n+2}, n=1 to 18). The dataset includes BSSE-corrected results from -2.2 kJ/mol (n=1) to -62.6 kJ/mol (n=18), with relative errors below 5% against benchmark calculations. It also contains thermodynamic analysis indicating spontaneous dimerization from n ≥ 8 at 100 K.

TabularIntermolecular InteractionsThermodynamicsVan Der WaalsBenchmarkComputational ChemistryAlkane Dimers+1

0 views

NLP & Text

HAMSR EPOCH Atmospheric Profiles from NASA Global Hawk

Twenty-five spectral channels from the High Altitude MMIC Sounding Radiometer (HAMSR) captured atmospheric data during the NASA EPOCH project in August 2017. This dataset provides measurements to infer three-dimensional profiles of temperature, water vapor, and cloud liquid water, even in cloudy conditions. It was collected from the NASA Global Hawk aircraft as part of a training and research mission focused on tropical cyclogenesis in the Eastern Pacific.

Time SeriesGeospatialEarth Science Clouds Atmosphere Cloud MicrophysicsNasa AirborneAtmospheric ScienceMicrowave RadiometryEarth Science Microwave Spectral Engineering BrighEarth Science Atmospheric Water Vapor Atmosphere WHurricane ResearchEarth Science Atmospheric Temperature Atmosphere S+1

0 views

NLP & Text

HS3 Global Hawk Navigation Data for Hurricane and Storm Research

Navigation and housekeeping data from NASA's Global Hawk aircraft during the Hurricane and Severe Storm Sentinel campaign. The dataset contains real-time 1 Hz UDP packets broadcast in IWG1 format, capturing flight and atmospheric measurements to study tropical storm formation and the Saharan Air Layer. It is produced by the National Aeronautics and Space Administration, with metadata last updated in March 2026.

TextTime SeriesTropical CyclonesAtmospheric ScienceAirborne InstrumentationFlight NavigationLarge ScaleHurricane ResearchSynthetic+1

0 views

NLP & Text

DNABERT_vectors: Plasmid and Chromosome Feature Embeddings

DNABERT embeddings calculated from plasmids and chromosomes. Maho Tokuda created this 2.1 MB dataset for a RandomForest model predicting plasmid destinations. The dataset was last updated in June 2026.

TabularCSVMachine LearningBioinformaticsPlasmidChromosome+1

0 views

NLP & Text

Parent input and language outcomes in CIs (Luo et al., 2026)

A longitudinal study dataset from 25 Mandarin-speaking children who received cochlear implants before 30 months of age. The data includes parent lexical diversity (NDW) and grammatical complexity (MLU) measures at 1 and 2 years post-implant, correlated with children's standardized language test scores at 3 years post-implant. The dataset was published by Luo et al. in 2026 and is hosted on figshare.

TabularAudioCochlear ImplantsParent InputLanguage acquisitionLongitudinal StudyMandarin Speaking+1

0 views

NLP & Text

Alkaline Mineralization of Aryl Trifluoromethyl Groups Near Benzylic N-Heterocycles

A Serena L. DiLiberti study reports the defluorination of trifluoromethyl arenes bearing an ortho N-heterocycle upon reaction with potassium tert-butoxide in THF at ambient temperature. The dataset likely contains experimental results from 14 demonstrated examples of this reaction, with yields up to 85%. The findings, shared on figshare in May 2026, are intended to inform synthetic route design for targets containing these functional groups.

TabularZIPOrganic ChemistrySynthetic ChemistryChemical ReactionsSynthetic+1

0 views

NLP & Text

NAMMA Lightning Detection Network Data from 2006 Campaign

Thirteen ground stations across Europe, Africa, and Brazil collected global lightning activity data from August 1 to October 1, 2006. This dataset was generated for the NASA African Monsoon Multidisciplinary Analyses campaign to study African Easterly Waves and Mesoscale Convective Systems. The network provides high temporal resolution of 1 millisecond and spatial accuracy ranging from 10-20 km within the network to over 50 km outside its periphery.

Time SeriesGeospatialTextLightning DetectionAfrican MonsoonAtmospheric ElectricityGeospatial Time SeriesSynthetic+1

0 views

NLP & Text

SPTT & SPTEdu-seq: Spatial Transcriptomics Data for Mouse and Human Tissues

Mouse and human spatial transcriptomics data generated using SPTT and SPTEdu-seq techniques. The dataset includes multiple mouse embryo, kidney, and brain samples, as well as human ccRCC frozen sections, with digital expression data in .mtx format and metadata. The dataset is 1.6 GB in size, authored by Shuang Zhang, and was last updated on May 14, 2026.

TabularZIPGene ExpressionMouse BrainSingle CellSpatial TranscriptomicsSpatial BiologySynthetic+1

0 views

NLP & Text

European Middle Pleistocene Hippopotamus Morphometric and Dietary Data

Roberta Martino's dataset from figshare, last updated May 2026, provides morphometric and microwear data on European hippopotamus fossils. The 177.4 KB XLSX file includes data from a review of Pleistocene specimens from Central and Western Europe. It focuses on mandibular and cranial features to assess phenotypic diversity and dietary shifts in Hippopotamus antiquus populations.

Tabular🇪🇺 EuropeExcelFossil DataNatural Language ProcessingZoologyMorphometricsPaleontology+1

0 views

PreviousPage 113 of 2100Next