DataSalon

Discover quality datasets for AI training — aggregated from 40+ platforms, curated by AI.

ProductSearch Datasets Browse Topics Rankings Community API / MCP

ResourcesDocumentation Blog Changelog Status

LegalPrivacy Policy Terms of Service Cookie Policy

NLP & Text Datasets | DataSalon

All Categories

📝

NLP & Text

Text classification, translation, QA, summarization, dialogue, sentiment analysis, language modeling, text corpora

42,058 datasets

NLP & Text

WHO 10-Group Classification: C-Section Patterns and Perceptions in Pakistan

Maria Atif's mixed-method study provides evidence on patterns and drivers of Cesarean sections in Pakistan. The dataset includes quantitative data from 605 women who underwent C-Sections in public, private, or semi-private facilities and qualitative insights from stakeholder perceptions. The data was last updated on 2026-05-14 and is shared under a CC-BY-4.0 license.

TabularExcelC SectionHealthcare SurveyMaternal HealthHealthcarePakistanWho Classification+1

0 views

NLP & Text

Raw ERG Data and R-Code for Photoreceptor-Specific Splicing Study

A study by Bohye Jeong investigates the role of Musashi protein paralogs MSI1 and MSI2 in photoreceptor-specific alternative splicing. The dataset includes raw ERG data and R-code supporting the analysis of splicing in Cc2d2a, Cep290, Prom1, and Ttc8 genes across combined Msi1 and Msi2 knockout models. The data was last updated on May 14, 2026, and is shared under a CC-BY-4.0 license.

TabularZIPGene ExpressionPhotoreceptor CellsAlternative splicingKnockout ModelsSyntheticMusashi Proteins+1

0 views

NLP & Text

Immunofluorescence Data for Musashi Gene Knockouts in Photoreceptor Cells

393.8 KB of immunofluorescence data from Bohye Jeong, last updated May 14, 2026. The dataset supports a study on the role of Musashi1 and Musashi2 proteins in regulating photoreceptor-specific splicing. It contains data from combined Msi1 and Msi2 knockout models used to analyze exon inclusion in genes Cc2d2a, Cep290, Prom1, and Ttc8.

ImageZIPGene ExpressionPhotoreceptor CellsSplicing AnalysisImmunofluorescenceKnockout ModelsSynthetic+1

0 views

NLP & Text

Mann-Whitney U Test Results for Topological Analysis of Abstract Paintings

5.5 KB of statistical test results from a study applying persistent homology to analyze abstract paintings. The dataset, authored by Emil Dmitruk and shared on figshare under CC-BY-4.0, compares two sets of images based on viewer eye tracking, brain activity, and subjective experience. It was last updated on May 14, 2026.

TabularExcelComputer VisionPersistent HomologyEye TrackingComputational TopologyAbstract ArtArt Analysis+1

0 views

NLP & Text

Global 30-Year Mean Monthly Climatology, 1930-1960

Monthly averages of mean temperature, temperature range, precipitation, rain days, and sunshine hours are provided for the terrestrial surface of the globe. The data is gridded at a 0.5-degree longitude/latitude resolution and represents a 30-year climatology from 1930 to 1960. It was generated from a large database using a partial thin-plate splining algorithm.

TabularGeospatialZIPTextGridded ClimateGlobal GriddedMonthly ClimatologyTemperatureGlobal ClimatologyMonthly AveragesPrecipitationClimate DataSynthetic+1

0 views

NLP & Text

SleepDepScore Evaluation Metrics for Sleep and Depression NLP Models

Evaluation metrics for the SleepDepNet model, a transformer-based multi-task learning framework for analyzing sleep quality and depressive sentiment from Reddit text. The dataset, authored by Akshi Kumar and last updated on 2026-05-07, contains performance scores including F1-scores of 0.89 for sleep quality classification and 0.86 for depressive sentiment analysis. It is stored in an XLS file with a size of 5.5 KB.

TabularExcelMental HealthSleep QualityDepression SentimentNlp EvaluationBenchmarkHealthcareNatural Language ProcessingReddit TextSynthetic+1

0 views

NLP & Text

SleepDepNet: Reddit Text Performance Comparison for Sleep and Depression Analysis

A 5.5 KB dataset on figshare, authored by Akshi Kumar and last updated May 7, 2026, under a CC-BY-4.0 license. It contains performance comparison results for the SleepDepNet multi-task learning framework, which models sleep quality and depressive sentiment from user-generated Reddit text. The dataset likely includes experimental results such as F1-scores of 0.89 and 0.86 for the model's classification tasks.

TabularExcelMental HealthSleep QualityDepression SentimentBenchmarkHealthcareNatural Language ProcessingReddit TextSynthetic+1

0 views

NLP & Text

SleepDepNet: Performance Metrics for Ablation Study on Sleep and Depression

5.5 KB of performance metrics for the SleepDepNet ablation study. The dataset, authored by Akshi Kumar and last updated on 2026-05-07, contains experimental results from a transformer-based multi-task learning framework analyzing Reddit text for sleep quality and depressive sentiment. It includes F1-scores of 0.89 for sleep quality classification and 0.86 for depressive sentiment analysis.

TabularExcelMental HealthSleep QualityMulti Task LearningDepression SentimentBenchmarkHealthcareNatural Language ProcessingReddit TextSynthetic+1

0 views

NLP & Text

SleepDepNet Evaluation Metrics for Sleep and Depression Classification from Reddit

Akshi Kumar's 2026 dataset contains evaluation metrics for the SleepDepNet model, a multi-task learning framework analyzing user-generated text. The dataset, stored in an XLS file of 5.5 KB, includes performance scores for classifying sleep quality and depressive sentiment from Reddit posts. Experimental results reported include an F1-score of 0.89 for sleep quality and 0.86 for depressive sentiment analysis.

TabularExcelMental HealthSleep QualityDepression SentimentNlp EvaluationBenchmarkHealthcareNatural Language ProcessingReddit TextSynthetic+1

0 views

NLP & Text

SleepDepNet: Reddit Text for Sleep and Depression Analysis

A dataset supporting the SleepDepNet multi-task learning framework, introduced by Akshi Kumar and last updated on 2026-05-07. The data consists of user-generated text collected from Reddit communities related to sleep and mental health. It is used to model the relationship between sleep quality and depressive sentiment.

TabularExcelMental HealthSleep QualityBenchmarkHealthcareNatural Language ProcessingReddit TextSyntheticDepression+1

0 views

NLP & Text

Characteristics of 1965 Chinese University Students During COVID-19 Campus Lockdown

1965 Chinese college students participated in a cross-sectional study during COVID-19 campus lockdowns. The dataset contains survey results exploring associations between psychological distress, lifestyle, career planning, and health-related quality of life. Data was collected via an online questionnaire platform using snowball sampling and analyzed by Baochen Su.

TabularExcelStudent Well BeingHealth Related Quality Of LifeCross SectionalHealthcareCovid 19Psychological Distress+1

0 views

NLP & Text

Community Perspectives on Psychosis in Malawi with 76 Participant Interviews

76 participants from Malawi's Salima and Chiradzulu districts were interviewed between October and December 2023. This qualitative dataset contains summarized themes and representative quotes from 16 in-depth interviews and six focus group discussions with traditional healers, religious leaders, caregivers, and persons with lived experience. The data explores community perspectives, treatment-seeking practices, and pathways for psychosis management.

TextExcelMental HealthPsychosisHealthcareFinanceCommunity HealthMalawiQualitative Research+1

0 views

NLP & Text

Collaborating Authors: Publications and Patents for Four Chinese Cities, 2016-2025

Hua Song compiled over 39,000 Web of Science publications and nearly 10,000 patent records from 2016 to 2025. The data covers four Chinese cities—Wuhan, Chengdu, Hangzhou, and Tianjin—and four high-tech domains: AI, fiber-optic communication, intelligent connected vehicles, and storage chips. The dataset was last updated on 2026-05-14.

TabularExcelRegional InnovationChina TechLlm Assisted AnalysisPatent AnalysisLarge ScaleBibliometricsSynthetic+1

0 views

NLP & Text

High-Tech Industry Innovation Metrics for Four Chinese Cities, 2016-2025

China's high-tech innovation landscape is analyzed through over 39,000 Web of Science publications and nearly 10,000 patent records from 2016 to 2025. The data covers Wuhan, Chengdu, Hangzhou, and Tianjin across AI, fiber-optic communication, intelligent vehicles, and storage chips. Author Hua Song compiled this dataset, last updated in May 2026, using bibliometric analysis and LLM-assisted semantic interpretation.

TabularExcelRegional Innovation SystemsLarge ScaleBibliometric AnalysisHigh Tech IndustriesChina CitiesSyntheticPatent Publication Data+1

0 views

NLP & Text

Japan Energy and Mining Indicators from the World Bank

World Bank data on energy production, use, dependency, and efficiency for Japan, compiled from the International Energy Agency and the Carbon Dioxide Information Analysis Center. The dataset addresses the sustainability of global energy trends amidst economic growth and industrialization. It was last updated on 2026-04-28.

TabularCSVSustainabilityWorld BankFinanceMiningEconomic GrowthEnergy+1

0 views

NLP & Text

Our Big Conversation: Resident Feedback on Pandemic Life Changes and Recovery Needs, 2020

An anonymized survey dataset from the 'Our Big Conversation' consultation run in 2020. It contains raw responses from residents on how life changed due to the COVID-19 pandemic and what is needed for recovery. The data was collected via an online survey and a paper survey in the June 2020 edition of 'Our City' and is published by the Government Digital Service under the OGL-UK-3.0 license.

TextTabularSurveyCovid 19RecoveryCommunity FeedbackPublic Opinion+1

0 views

NLP & Text

India Energy and Mining Indicators from World Bank Data

World Bank data compiled from the International Energy Agency and the Carbon Dioxide Information Analysis Center. It contains indicators on energy production, use, dependency, and efficiency for India, reflecting trends in the world economy and industrialization. The dataset was last updated on 2026-04-28.

TabularCSVEnergy ProductionEnergy UseWorld BankEnergy efficiencyFinanceMining+1

0 views

NLP & Text

Eromanga Basin Hydrogeological Inventory for the Great Artesian Basin

The Australian Ocean Data Network provides an inventory of descriptive attributes for the Eromanga Basin groundwater system. The dataset covers over 1,250,000 square kilometres in central and eastern Australia and includes themes such as location, demographics, geology, hydrogeology, and land use. It was last updated on 2026-05-05.

Geospatial🇦🇺 AustraliaGeologyGroundwaterHydrogeology+1

0 views

NLP & Text

Second-Generation RARα Antagonist for Male Contraception: Compound 23 Data

A 2026 dataset from figshare by Rui Shi details the development of a selective RARα antagonist for male contraception. It includes information on the discovery of compound 23, a highly potent and selective inhibitor with an IC50 of 0.051 nM and >1650-fold selectivity over RARβ. The data covers the compound's ADMET properties, oral bioavailability, and contraceptive efficacy in mice.

TabularMale ContraceptionPharmacologyDrug DiscoveryRetinoic Acid ReceptorSar Studies+1

0 views

NLP & Text

RARα Antagonist Compound 23: Structure and Activity Data for Male Contraception

300.6 KB of data on benzopyran-, benzofuran-, and benzothiophene-derived RARα inhibitors for male contraception. The dataset includes SAR studies leading to compound 23, a highly potent and selective antagonist with an IC50 of 0.051 nM, published by Rui Shi on figshare in May 2026. Compound 23 is described as orally bioavailable and effective in reducing sperm counts in mice.

TabularMale ContraceptionPharmacologyDrug DiscoveryRetinoic Acid ReceptorSar Studies+1

0 views

PreviousPage 108 of 2099Next