DataSalon

Discover quality datasets for AI training — aggregated from 40+ platforms, curated by AI.

ProductSearch Datasets Browse Topics Rankings Community API / MCP

ResourcesDocumentation Blog Changelog Status

LegalPrivacy Policy Terms of Service Cookie Policy

NLP & Text Datasets | DataSalon

All Categories

📝

NLP & Text

Text classification, translation, QA, summarization, dialogue, sentiment analysis, language modeling, text corpora

44,558 datasets

NLP & Text

Argo Float Data: Ocean Temperature and Salinity Profiles from Australian Waters

More than 3,000 autonomous floats collect high-quality temperature and salinity measurements from the upper 2000 meters of the world's ice-free oceans. Each float completes approximately 150 cycles, surfacing every 10 days to transmit data via satellite. This dataset from Argo Australia and the Australian Ocean Data Network provides real-time observations of oceans surrounding Australia.

Time SeriesGeospatialOceanographyMarine ScienceTemperature SalinityArgo Floats+1

0 views

NLP & Text

SciIR-82k: 80,000+ Scientific Image-Text Pairs for Reasoning Generation

SciIR-82k is a large-scale dataset for Scientific Image Reasoning Generation, containing more than 80,000 high-quality scientific image-text pairs. The samples are derived from open-access scientific publications and enriched with structured reasoning annotations. The dataset was created by author 'contton-sss' and was last updated on June 20, 2026.

MultimodalImage Text PairsScientific Image ReasoningScientific PublicationsBenchmarkComputer VisionAi TrainingLarge Scale+1

0 views

NLP & Text

Rice Cold Stress Phenotypic Data and Images from Seedling to Flowering Stages

A dataset from figshare by Fahamida Akter, last updated in April 2026, containing 12.5 MB of files related to cold stress in rice. It includes phenotypic performance data for 38 rice genotypes and supporting images documenting artificial cold screening at seedling and reproductive stages. The data covers traits like leaf discoloration scores, survival rates, and cluster analysis of cold-related traits.

ImageTabularAgricultural researchPhenotypic TraitsRice Cold TolerancePlant PhysiologySynthetic+1

0 views

NLP & Text

Biometric Gait Data from Mobile Phone Sensors for a Two-Day Training Set

A 3.9 GB repository related to a biometric gait system publication. It contains files for minimal reproduction of experiments on the SIGNET data corpus and a notebook with results. The dataset was authored by aleksander sawicki and last updated on 2026-05-27.

TabularZIPMobile PhoneGait AnalysisBiometricsMotion SensorsNatural Language ProcessingReproducible Research+1

0 views

NLP & Text

Experimental Data on Dental Instrument Cleaning Efficacy and Structural Integrity

Experimental data from an in vitro study evaluating three moisturizing pretreatments on reusable dental instruments. The dataset includes cleanliness scores, ATP values, and SEM/EDS analysis results for 30 surgical burs and 30 Nickel-Titanium files per type. Xiuyu Tang published the data on figshare in April 2026.

TabularExcelInstrument IntegrityCleaning EfficacyBioburdenIn Vitro StudyDental Instruments+1

0 views

NLP & Text

Behavioral Biometrics in VR: Sensor Signal Modalities from BUT Corpus

A 1.9 GB repository enabling minimal reproduction of experiments from the BUT data corpus, related to the publication 'Behavioral Biometrics in VR: Changing Sensor Signal Modalities'. It was authored by Aleksander Sawicki and last updated on 2026-05-27.

MultimodalZIPSensor SignalsVirtual RealityBehavioral BiometricsNatural Language ProcessingBiometric Data+1

0 views

NLP & Text

ICBF Information Assets Registry: Public Data Inventory

A public information asset registry from the Colombian Institute of Family Welfare (ICBF), created to comply with Law 1712 of 2014. The dataset likely contains metadata on information categories the entity generates, obtains, acquires, transforms, or controls. It was last updated on 2026-05-18 and is available via the www.datos.gov.co platform.

TabularCSVXMLJSONColombiaPublic InformationMetadata RegistryGovernment Inventory+1

0 views

NLP & Text

Imaginative Perception Token Pet IPT: Multimodal Spatial Reasoning Data

Released with the paper 'Imaginative Perception Tokens Enhance Spatial Reasoning in Multimodal Language Models' (arXiv:2606.03988). The dataset was authored by weikaih and last updated on Hugging Face on 2026-06-08.

MultimodalMachine LearningSpatial ReasoningMultimodal Language ModelsAi TrainingToken PerceptionImaginative PerceptionPerception Tokens+1

0 views

NLP & Text

Mouse Gut Microbiota Composition After Acute High-Intensity Exercise

A 15.2 KB dataset from figshare contains results from a study investigating dynamic alterations in gut microbiota following a 30-minute high-intensity treadmill run in BALB/c and C57BL/6 mice. Colonic content samples were collected at 0, 30, and 60 minutes post-exercise for 16S rRNA gene sequencing. The dataset, authored by Ruolin Gao and last updated in April 2026, shows strain-specific microbial changes and energy metabolism responses.

TabularTime SeriesExcelMouse Model16s-rrna-sequencingHealthcareGut MicrobiotaEnergy metabolismExercise Physiology+1

0 views

NLP & Text

Mouse Gut Microbiota Composition and Energy Metabolism After Acute High-Intensity Exercise

A study investigating dynamic alterations in gut microbiota following a 30-minute high-intensity treadmill run in BALB/c and C57BL/6 mice. Colonic content samples were collected at 0, 30, and 60 minutes post-exercise for 16S rRNA gene sequencing. The dataset, authored by Ruolin Gao and last updated in April 2026, is shared under a CC-BY-4.0 license.

TabularTime SeriesMouse Model16s-rrna-sequencingHealthcareGut MicrobiotaEnergy metabolismExercise Physiology+1

0 views

NLP & Text

Mouse Gut Microbiota Composition After Acute High-Intensity Exercise

Ruolin Gao's study on figshare, last updated April 22, 2026, investigates dynamic changes in gut microbiota following acute high-intensity exercise in BALB/c and C57BL/6 mouse strains. The dataset, 20.8 KB in size, includes results from 16S rRNA gene sequencing of colonic content samples collected at 0, 30, and 60 minutes post-exercise. It captures strain-specific microbial diversity and functional responses related to energy metabolism and gut integrity.

TabularTime SeriesMouse Model16s-rrna-sequencingHealthcareGut MicrobiotaEnergy metabolismExercise Physiology+1

0 views

NLP & Text

Joint U.S.-Russian Arctic Sea Ice Atlas From Cold War Era

Observations from the Environmental Working Group Joint U.S.-Russian Arctic Sea Ice Atlas document Arctic sea ice conditions from 1950 to 1994. The atlas synthesizes data from satellites, ice stations, icebreakers, airborne surveys, and previously classified U.S. submarine missions from 1977-1993. It was developed through a collaborative U.S.-Russian partnership in the late 1990s and includes graphical ice charts, analysis methods, and climatological data.

GeospatialMultimodalGeospatial AtlasCold War EraArctic Sea IceSatellite ObservationsIce Climatology+1

0 views

NLP & Text

Microagent Sft V1: 6,820 Procedurally Generated SFT Examples for Small Reasoning Agents

Hudsongouge created a dataset of 6,820 procedurally generated supervised fine-tuning (SFT) examples, last updated on 2026-06-16. It is designed for training small reasoning agents with 1–3 billion parameters. The data aims to teach models to think before answering, use tools honestly, and refuse when evidence is missing.

TextAnti HallucinationTool UseReasoning AgentsSynthetic Training DataInstruction TuningSynthetic+1

0 views

NLP & Text

Cinematic Beauty: Qualitative Interview Data on Aesthetic Experience

A multimodal dataset from a qualitative interview study on the experience of beauty, using film clips as stimuli. The dataset includes video files of the stimuli, anonymized interview transcripts, visual bodily sensation maps, and analysis spreadsheets with thematic categories. It was created by Jakob Boer and last updated on June 8, 2026.

TextTabularVideoMultimodalPhenomenologyMultimodal DataBeauty ExperienceQualitative InterviewsFilm StimuliSynthetic+1

0 views

NLP & Text

Implicit and Explicit Voice Training Effects on Speech Perception and Listening Effort

32 normal-hearing participants completed speech-on-speech listening tasks after implicit or explicit voice training. The study, conducted by Ada Bicer, measured speech intelligibility and pupil dilation responses at three target-to-masker ratios. Results were harvested into DataverseNL and last updated on June 8, 2026.

TabularAudioSpeech PerceptionPupillometryListening EffortAuditory ResearchVoice Training+1

0 views

NLP & Text

Document Classification Scheme for Envigado's Personería, with Series and Subseries

A document classification scheme reflecting the hierarchy of records produced by an institution. It corresponds to validated document retention tables for the entity, with columns indicating sections, subsections, series, and subseries. The dataset is hosted by www.datos.gov.co and was last updated on 2026-05-18.

TabularCSVXMLJSONRecords ManagementGovernment DocumentsOrganizational HierarchyDocument Classification+1

0 views

NLP & Text

EweBench: Benchmark for Evaluating LLMs in the Ewe Language

EweBench is the first standardized benchmark for evaluating Large Language Models on the Ewe language, a Kwa language spoken by approximately 7 million people in Togo and Ghana. It is hosted on Hugging Face by the author 'jojonocode' and was last updated on 2026-06-24. The dataset serves as a reference for assessing model performance on this specific language.

TextMultilingualEwe LanguageBenchmarkLlm EvaluationLarge ScaleMultilingual Nlp+1

0 views

NLP & Text

DCLM Data 300M: GPT-2-Tokenized Sequences for Data-Constrained Language Model Training

Pre-tokenized .pt files containing packed GPT-2-tokenized sequences derived from the DCLM corpus. The dataset snapshots were curated by author zhiwei555 for the paper 'Data-Constrained Language Model Pretraining: Improved Regularization and Scaling Laws'. They were last updated on June 8, 2026.

TextLanguage Model PretrainingTokenized DataNlp ResearchNatural Language ProcessingText Corpus+1

0 views

NLP & Text

MathNet-Retrieve: 15,000 Olympiad Problems for Mathematical Retrieval Benchmarking

MathNet-Retrieve is a benchmark for math-aware information retrieval, created by ShadenA and last updated in June 2026. It contains 15,000 queries, each with a mathematically equivalent reformulation target provided at three difficulty tiers. The benchmark is designed to test retrieval systems on problems where the surface form is disguised while the underlying mathematical structure is preserved.

TextMathematicsBenchmarkNatural Language ProcessingOlympiad ProblemsInformation RetrievalNlp Benchmark+1

0 views

NLP & Text

Quantum Implementation of Non-Unitary Operations with Biorthogonal Representations

A dataset from the Plasma Science and Fusion Center Dataverse, authored by Efstratios Koukoutsis and colleagues, proposes a new dilation method for quantum implementation of non-unitary operations. The method maps non-unitary operators to isomorphic unitary matrices using biorthogonal representations. It excels for operators with eigenvalues exceeding one in absolute value and is optimal for small-dimensional cases.

TabularQuantum AlgorithmsNon Unitary OperationsQuantum ComputingBiorthogonal Representation+1

0 views

PreviousPage 302 of 2224Next