DataSalon

Discover quality datasets for AI training — aggregated from 40+ platforms, curated by AI.

ProductSearch Datasets Browse Topics Rankings Community API / MCP

ResourcesDocumentation Blog Changelog Status

LegalPrivacy Policy Terms of Service Cookie Policy

NLP & Text Datasets | DataSalon

All Categories

📝

NLP & Text

Text classification, translation, QA, summarization, dialogue, sentiment analysis, language modeling, text corpora

41,487 datasets

NLP & Text

U.S. Supplier Direct Purchases by Industry and Business Size, Q2 2026

Statistics Canada provides a quarterly survey measuring the percentage of purchases made directly from U.S. suppliers by Canadian businesses. The data is broken down by NAICS industry classification, business employment size, type of business, activity, and majority ownership for the second quarter of 2026. It is available in XML, CSV, and HTML formats under the OGL-CA-2.0 license.

TabularCSVXMLNaicsEconomic SurveySupply ChainBusiness Purchasing+1

0 views

NLP & Text

First-Trimester Biomarkers and Gestational Diabetes Risk in a Prospective Cohort

A prospective cohort study by Xue Wei, published on figshare, followed 231 singleton pregnant women to investigate early predictors of gestational diabetes mellitus (GDM). The dataset likely contains measurements of serum agouti signalling protein (ASIP), the triglyceride-glucose (TyG) index, and routine metabolic parameters taken during the first (8–12 weeks) and second (24–28 weeks) trimesters. The study found that elevated first-trimester ASIP and TyG index were independent risk factors for GDM, with their combination showing predictive value.

TabularGestational DiabetesBiomarkersPregnancy HealthClinical Study+1

0 views

NLP & Text

Respond Service Health Assessments for Asylum Seekers in London, 2021-2023

1497 health assessments were conducted for people seeking asylum in North-Central London from July 2021 to March 2023. The data, published by Paola Cinardo on figshare, includes clinical findings and interview data, showing high rates of physical and mental health needs. 83.2% of attendees had at least one identified health need.

TabularExcelMental HealthAsylum SeekersHealthcareClinical DataHealth AssessmentPublic Health+1

0 views

NLP & Text

Senior Management Expenses for The City of Calgary, 2026

Senior management expense reports from The City of Calgary, released twice per year in spring and fall. The data includes line item details for the City Manager, general managers, and directors. Budgets for these positions are reviewed and approved annually by City Council.

TabularCSVXMLJSONFinancial TransparencyGovernment ExpensesPublic AdministrationSenior Management+1

0 views

NLP & Text

Great Barrier Reef Inter-Reefal Seabed Sediments and Geomorphology, Regional Synthesis

Geoscience Australia and the Australian Ocean Data Network provide a regional synthesis of inter-reefal seabed environments in the Great Barrier Reef Marine Park. The dataset integrates over 3,000 sediment samples from the MARS database with geomorphic feature data, offering the first such synthesis since the 1980s. It reveals regional trends and local-scale characteristics in sediment distribution, including gravel, sand, and mud concentrations across the shelf.

AudioGeospatialGreat Barrier ReefSpatial AnalysisGeomorphologyMarine Geology+1

0 views

NLP & Text

Experimental Data for Glass Fiber Reinforced Cement-Metakaolin Stabilized Silty Clay

115.6 MB of standardized experimental data generated by Benhui Pang for a study on interface-engineered glass fibers. The dataset includes raw and processed data for index properties, compaction behavior, tensile performance, mix optimization, strength tests, durability assessments, and microstructural analyses. It was last updated on 2026-05-21 to support transparency and reproducibility.

ImageTabularZIPCivil EngineeringMaterial TestingMicrostructural AnalysisSoil StabilizationSynthetic+1

0 views

NLP & Text

Belt and Road Land-Use Synergy and Carbon Stocks, 2000–2022

8.3 MB of model input data, simulation outputs, and Python code supporting a study on land-use change and terrestrial carbon stocks across Belt and Road countries. The dataset covers a historical period from 2000 to 2022 and includes scenario simulations for future policy pathways. It was authored by Lulu Qu and last updated on 2026-05-21.

TabularGeospatialZIPSystem DynamicsCarbon StocksLand Use ChangeScenario SimulationBenchmarkBelt And RoadFinance+1

0 views

NLP & Text

Visual Teach-and-Repeat Navigation Data for Low-Light and Night-Time Environments

A 23.5 GB dataset for visual teach-and-repeat (VTR) navigation designed to operate robustly in environments with variable or low light levels. The data, authored by Fuhai Ling and last updated in May 2026, supports a framework integrating deep-learned descriptors, stereo imaging, and event-based cameras. Experiments demonstrate the system's performance in night-time urban environments for both indoor and outdoor navigation.

ImageMultimodalZIPLow Light EnvironmentsVisual Teach And RepeatRobotic NavigationDrift CorrectionComputer VisionEvent Based Vision+1

0 views

NLP & Text

Geoscience Australia Bathymetry Compilations Extent as of June 2019

Polygon extents represent bathymetry compilation products delivered by Geoscience Australia as of June 2019. The compilations were generated from numerous data sources including survey data, lidar, and interpolation. Each polygon's attributes contain information regarding data sources, product details, and access methods.

Geospatial🇦🇺 AustraliaOceanographyMarine ScienceSyntheticBathymetry+1

0 views

NLP & Text

Litchi Mutagenesis Data: Pingyangmycin-Induced Somatic Mutations from Embryogenic Callus

307,629 high-quality somatic mutations were identified in litchi embryogenic callus treated with pingyangmycin. The dataset, authored by Guo Wang and last updated in May 2026, contains whole-genome resequencing results from treated callus and 40 regenerated mutant lines, reporting mutation frequencies of 1.8×10⁻⁴ and 1.4×10⁻⁴ per site.

TextPlant MutagenesisIn Vitro CultureSomatic MutationGenomicsLitchi Breeding+1

0 views

NLP & Text

Litchi Mutagenesis Data: Pingyangmycin-Induced Somatic Mutations and Regenerated Plants

A research document details the establishment of a pingyangmycin-induced mutagenesis system for litchi using in vitro-cultured embryogenic callus. The study identified 307,629 high-quality somatic mutations in treated callus and over 1.2 million variants in regenerated mutant lines, with mutation frequencies exceeding typical EMS-induced rates. The document was authored by Guo Wang and last updated on 2026-05-08.

TextPlant MutagenesisSomatic MutationMolecular BreedingLitchi GenomicsEmbryogenic Callus+1

0 views

NLP & Text

UK Biobank Frailty and Degenerative Bone/Joint Disease Risk Study

A research dataset from a prospective cohort study using UK Biobank data. It examines the relationship between frailty status, its longitudinal changes, and the incident risk of degenerative bone and joint diseases and their multimorbidity. The study was authored by Minghao Jin and the dataset was last updated in May 2026.

TabularProspective CohortUk BiobankBenchmarkHealthcareDegenerative Bone Joint DiseasesFrailty IndexMultimorbidity+1

0 views

NLP & Text

TyG-WHtR Predicts Incident Type 2 Diabetes in NAFLD: A 12-Year Prospective Cohort Study

A 12-year prospective cohort study of 2,370 Japanese patients with nonalcoholic fatty liver disease (NAFLD) evaluates the predictive performance of twelve metabolic composite indices for incident type 2 diabetes mellitus. The dataset, authored by Nan’nan Chen and last updated in May 2026, likely contains patient-level data used to calculate indices like TyG-WC, TyG-WHtR, and VAI, and their association with diabetes onset via Cox models and ROC analysis.

TabularNonalcoholic Fatty Liver DiseaseClinical PredictionProspective CohortHealthcareMetabolic IndicesType 2 Diabetes+1

0 views

NLP & Text

TyG-WHtR Predicts Diabetes in NAFLD Patients: A 12-Year Japanese Cohort Study

A 12-year prospective cohort study of 2,370 Japanese patients with nonalcoholic fatty liver disease (NAFLD) evaluates twelve metabolic composite indices for predicting incident type 2 diabetes. The triglyceride–glucose–waist–height ratio (TyG-WHtR) demonstrated the highest predictive accuracy with an AUC of 0.680 and an optimal cut-off of 4.54. Authored by Nan’nan Chen and shared under CC-BY-4.0, this research dataset was last updated on May 1, 2026.

TabularJapanese PopulationProspective CohortHealthcareClinical ResearchMetabolic IndicesType 2 Diabetes+1

0 views

NLP & Text

TyG-WHtR Predicts Incident Type 2 Diabetes in NAFLD: A 12-Year Prospective Cohort Study

A 12-year prospective cohort study of 2,370 Japanese patients with nonalcoholic fatty liver disease (NAFLD) evaluates the predictive ability of twelve metabolic composite indices for incident type 2 diabetes mellitus. The dataset, authored by Nan’nan Chen and last updated in 2026, likely contains patient-level clinical and outcome data used to calculate indices like TyG-WC, TyG-WHtR, and VAI. Results indicate the TyG-WHtR index had the highest predictive accuracy for diabetes onset in this population.

TabularNonalcoholic Fatty Liver DiseaseProspective CohortClinical PredictorsHealthcareMetabolic IndicesType 2 Diabetes+1

0 views

NLP & Text

Metabolic Indices Predicting Type 2 Diabetes in NAFLD Patients

A secondary analysis of 2,370 NAFLD patients from a prospective Japanese cohort study evaluates the predictive ability of twelve metabolic composite indices for incident type 2 diabetes over a 12-year follow-up. The dataset, authored by Nan’nan Chen and shared under a CC-BY-4.0 license, includes hazard ratios and AUC values for indices like TyG-WC, TyG-WHtR, and VAI. It was last updated on 2026-05-01.

TabularNonalcoholic Fatty Liver DiseaseProspective CohortHealthcareClinical PrognosisMetabolic IndicesType 2 Diabetes+1

0 views

NLP & Text

TyG-WHtR Predicts Incident Type 2 Diabetes in NAFLD: A 12-Year Prospective Cohort Study

Nan’nan Chen's research dataset contains results from a 12-year prospective cohort study of 2,370 Japanese patients with nonalcoholic fatty liver disease (NAFLD). It compares the predictive ability of twelve metabolic composite indices for the onset of type 2 diabetes mellitus (T2DM). The dataset was last updated on 2026-05-01.

TabularNonalcoholic Fatty Liver DiseaseProspective CohortHealthcareClinical ResearchMetabolic IndicesType 2 Diabetes+1

0 views

NLP & Text

TyG-WHtR Predicts Type 2 Diabetes in NAFLD: A 12-Year Prospective Japanese Cohort Study

A Japanese prospective cohort of 2,370 patients with nonalcoholic fatty liver disease (NAFLD) was used to evaluate twelve metabolic composite indices for predicting incident type 2 diabetes mellitus (T2DM). The study, authored by Nan’nan Chen, was last updated on May 1, 2026. It found the triglyceride–glucose–waist–height ratio (TyG-WHtR) had the highest predictive accuracy with an AUC of 0.680.

TabularNonalcoholic Fatty Liver DiseaseProspective CohortHealthcareClinical ResearchMetabolic IndicesType 2 Diabetes+1

0 views

NLP & Text

Colombian General System Transfer Resources for Poor Populations, 2015-2021

Colombian municipal and district-level financial transfers from the General System for poor, uninsured populations from 2015 to 2021. The dataset includes columns for payer, payment orders, identification numbers, payment dates, concepts, funding sources, and transferred values. It is hosted by the Colombian government's open data portal, datos.gov.co, and was last updated in May 2026.

TabularCSVXMLJSONSocial WelfareGovernment transfersColombiaPublic Finance+1

0 views

NLP & Text

Open-Weight LLM Evaluation for Screening RNA-seq Metadata

Supplementary materials for a pilot evaluation of 17 open-weight large language models screening RNA-seq metadata. The dataset includes performance metrics like AUPRC and F1 scores, runtime distributions, and reproducibility data across 150 projects per model. Mitsuo Shintani authored this CC-BY-4.0 licensed dataset, last updated in May 2026.

TabularZIPCSVTSVMachine LearningRna SeqBenchmarkBioinformaticsLlm EvaluationSynthetic+1

0 views

PreviousPage 76 of 2070Next