DataSalon

Discover quality datasets for AI training — aggregated from 40+ platforms, curated by AI.

ProductSearch Datasets Browse Topics Rankings Community API / MCP

ResourcesDocumentation Blog Changelog Status

LegalPrivacy Policy Terms of Service Cookie Policy

NLP & Text Datasets | DataSalon

All Categories

📝

NLP & Text

Text classification, translation, QA, summarization, dialogue, sentiment analysis, language modeling, text corpora

43,991 datasets

NLP & Text

Evaluation of CIDA's Regional Inter-American Program: 2004-2005 to 2009-2010 Reports

Evaluation reports from Global Affairs Canada serve as a practical management tool for reviewing program performance. The collection includes reports for the Canadian International Development Agency's Regional Inter-American Program spanning the 2004-2005 to 2009-2010 fiscal years. Each report aims to improve the design and implementation of upcoming international development initiatives.

Text🇨🇦 CanadaBenchmarkInternational DevelopmentProgram EvaluationGovernment ReportsSynthetic+1

0 views

NLP & Text

AISHELL8-RealScene: Mandarin Conversational Speech with Multi-View Video

AISHELL8-RealScene is a multimodal dataset of conversational Mandarin speech recorded in real-world settings. It contains 102.19 hours of audio from 171 foreground speakers across 5 different locations. The dataset was created by SMIIP-lab and includes synchronized near-field and 8-channel far-field audio with multi-view facial video.

AudioVideoMandarin ChineseMultimodal Audio VideoConversational SpeechFar Field AudioSpeech Recognition+1

0 views

NLP & Text

Trp521 Oxidation Effects on FtsH2 Stability and PSII Repair

Raw data supporting figures and tables for the paper 'Trp521 oxidation affects FtsH2 stability and its role in PSII repair.' The 35.5 MB XLSX file was authored by ZHANG JINGZHI and last updated on May 30, 2026. It is shared under a CC-BY-4.0 license on figshare.

TabularExcelProtein stabilityPhotosynthesisMolecular BiologyPlant ScienceOxidation+1

0 views

NLP & Text

Emotional Responses to Virtual Public Speaking Tasks with Audience Manipulation

A study by Evania Fasya involving 102 participants with varied public speaking anxiety levels. Participants delivered speeches to virtual audiences of varying sizes and attitudes, with physiological signals, speech characteristics, subjective stress, and audience evaluations measured. The dataset was last updated on June 15, 2026.

TabularAudioPhysiological SignalsVirtual RealityPsychologyBenchmarkExperimental DataPublic Speaking Anxiety+1

0 views

NLP & Text

GPT-5.5 Agent: Traces from AI Agent Interactions

GPT-5.5 Agent contains raw agent trace files generated by the teich platform from TeichAI. The dataset includes 89 JSONL files with metadata indicating the underlying model was GPT-5.5. It was uploaded by AletheiaResearch and last updated on June 23, 2026.

TextAgent TracesLlm TrainingTool SchemasGpt 5 5Synthetic+1

0 views

NLP & Text

Generative AI Privacy Survey of South Korean Users, 400 Valid Responses

December 3–13, 2025 survey of generative AI users in South Korea aged 14–69 who used Gen AI at least once per week. The data comprises 400 valid survey responses collected via a stratified panel sampling procedure by Macromill Embrain, with IRB approval and participant consent. It was authored by Hongjin Shim and published on Harvard Dataverse.

TabularSouth KoreaGenerative AiSurvey DataPrivacyUser Behavior+1

0 views

NLP & Text

Table 1_Serial passaging in vitro generates a Vero cell-adapted coxsackievirus A6 strain w

Sijin Xia uploaded a research dataset on 2026-04-28 detailing the in vitro adaptation of Coxsackievirus A6 (CVA6) in Vero cells. The dataset likely contains phenotypic and transcriptomic comparisons between two recombinant virus strains, rV10 and rV45, generated from passage 10 and passage 45. The 2.0 MB file is a DOCX document describing viral growth, cytopathic effects, receptor interaction, pathogenicity, and host transcriptomic responses.

TextVirus AdaptationVirologyHealthcareVero CellsVaccine ResearchSyntheticHand Foot Mouth Disease+1

0 views

NLP & Text

Cadmium Phytoremediation Enhancement in Sweet Sorghum with Chemical Amendments

Experimental data from figshare evaluates chemical amendments for enhancing cadmium accumulation in sweet sorghum. The dataset, authored by Juan Li and last updated in April 2026, likely contains measurements from trials applying ferric chloride, citric acid, and polyaspartic acid. It supports the development of a practical strategy for phytoremediation combined with bioethanol production.

TabularAudioExcelCadmiumSweet SorghumAgricultural researchSoil AmendmentPhytoremediation+1

0 views

NLP & Text

SEIGE Attack Evals: Adversarial Prompts and Model Responses

495 attack-level evaluation rows generated by the SEIGE framework from a local Ollama model sweep. The dataset includes attack prompts, model responses, pass/fail outcomes, risk scores, and metadata for security analysis. It was created by user 'tmesttttttttt' and last updated on June 5, 2026.

TabularPrompt AttacksBenchmarkModel SafetyLlm SecurityAdversarial EvaluationSynthetic+1

0 views

NLP & Text

Pre-1930 Public Domain Educational Texts for Instruction Tuning

27 public-domain educational texts published before 1930 form this supervised fine-tuning dataset. The texts, sourced from the Internet Archive, span natural science, history, law, philosophy, and grammar, and are written in a question-and-answer catechism format. The dataset was created by zachnorton03 and last updated on June 19, 2026.

TextQuestion-AnswerPre 1930Instruction TuningEducational TextsPublic Domain+1

0 views

NLP & Text

Claude Distills: Unified Datasets for Language Model Distillation

A curated collection of open-source datasets for distilling knowledge from Anthropic's Claude models. The repository contains at least two unified subsets, including 'claude-sonnet-4.6-120000x' with 119,446 samples and 'claude-opus-4.6-10000x' with 9,633 samples. The data was aggregated and formatted by ansulev, with credit to original creators, and was last updated on 2026-06-16.

TextClaude DistillationLanguage ModelInstruction TuningSynthetic Data+1

0 views

NLP & Text

Shell-Code-Large: 640,000 Shell Scripting Code Samples for LLM Pretraining

Approximately 640,000 Shell scripting code samples compiled by ajibawa-2023, last updated on 2026-06-20. This large-scale corpus is stored in JSON Lines format and is designed to support research in code intelligence and automation. The dataset's primary purpose is to facilitate large language model pretraining and software engineering tasks.

TextSource CodeShell ScriptingDevops AutomationLlm PretrainingLarge ScaleNatural Language Processing+1

0 views

NLP & Text

Speech Translation And Summarization: English-Centric Multilingual Audio Dataset

McGill-NLP provides generated article and summary audio for English-centric multilingual directions. The dataset includes audio files and metadata for language pairs such as Amharic-English, Arabic-English, Bengali-English, and Chinese-English. It was last updated on June 17, 2026.

AudioMultimodalMultilingualSummarizationNatural Language ProcessingSpeech TranslationMultilingual AudioSynthetic+1

0 views

NLP & Text

Peritraumatic CRP Levels and Pain Outcomes Following Traumatic Stress Exposure

AURORA study data from 385 men and women tracks C-reactive protein levels and pain severity after traumatic stress exposure. Lauren A. McKibben's research, last updated in 2026, reveals a sex-dependent relationship where peritraumatic CRP predicts chronic pain outcomes in men but not women. The dataset likely contains longitudinal biomarker measurements and pain questionnaire responses collected from emergency department patients.

TabularInflammatory MarkersPain BiomarkersLongitudinal StudyClinical ResearchSex Differences+1

0 views

NLP & Text

Exploring Art, Knowledge and Movement in Japanese Fashion

A PDF document explores the Japanese concept of 'ma' and its influence on fashion design. The work by Vivien Jiaqian Zhu connects Greek philosophy, avant-garde Japanese designers, the Paris fashion scene, and Chinese classics like The Dream of the Red Chamber. It was last updated on 2026-05-16 and is licensed under CC-BY-4.0.

TextPerformance StudiesArt HistoryPhilosophyJapanese FashionCULTURAL STUDIES+1

0 views

NLP & Text

FreeStyle Dataset: CRef/SRef LoRA Triplets for Diffusion Training

A collection of CRef/SRef LoRA triplets exported from the 0426 diffusion training data. Each training example contains three images: a content reference, a style reference, and a target image generated from the combined condition. The dataset was created by Blue2Giant and last updated on June 17, 2026.

ImageTabularLora TrainingComputer VisionImage GenerationDiffusion ModelsSynthetic+1

0 views

NLP & Text

MERP: Music Emotion Recognition with Profile Information

54 full-length songs were dynamically rated for valence and arousal by listeners recruited via Amazon Mechanical Turk. The MERP dataset was created by amaai-lab for music emotion recognition research. The dataset page was last updated on 2026-06-20.

TabularAudioMultimodalMachine LearningAmaai LabValence ArousalAi ResearchHuman RatingsMusic Emotion Recognition+1

0 views

NLP & Text

Geospatial Data Collection with 3D Models and Disaster Information

A curated collection of publicly available geospatial datasets from the Geospatial Information Center that specify a Creative Commons license. The data includes 3D city and building models, elevation and terrain data, vector datasets for administrative boundaries and land use, and disaster-related geospatial information. The collection is hosted on AWS S3 and aggregated by the organization AIGID.

GeospatialAdministrative Boundaries3d City ModelsElevation DataDisaster Management+1

0 views

NLP & Text

INDICADORES RESOLUCIÓN 1522 DE 2013: Colombian Healthcare Appointment Wait Time Metrics

INDICADORES RESOLUCIÓN 1522 DE 2013 is a quarterly report of indicators from Resolution 1552, measuring appointment opportunity for general medicine, dentistry, and specialized medicine in Colombia. The dataset is hosted on the Colombian open data portal www.datos.gov.co and was last updated on 2026-05-27. It contains columns such as MAX DIAS ESPERA, MIN DIAS ESPERA, HORAS PROMEDIO, and PERIODO.

TabularTime SeriesCSVXMLJSONColombiaPerformance IndicatorsWait Times+1

0 views

NLP & Text

Women's Attitudes and Lived Experiences with Oral PrEP Use

Qualitative data from focus groups and key informants examines women's attitudes towards oral pre-exposure prophylaxis for HIV prevention. The 650.8 KB dataset, authored by Grace Kenyonga and last updated in May 2026, contains manually analyzed thematic findings. Women's views on PrEP include empowerment for health control and concerns about promiscuity and partner conflict.

TextHiv PreventionPrep AttitudesHealthcareWomens HealthPublic HealthQualitative Research+1

0 views

PreviousPage 222 of 2195Next