DataSalon

Discover quality datasets for AI training — aggregated from 40+ platforms, curated by AI.

ProductSearch Datasets Browse Topics Rankings Community API / MCP

ResourcesDocumentation Blog Changelog Status

LegalPrivacy Policy Terms of Service Cookie Policy

NLP & Text Datasets | DataSalon

All Categories

📝

NLP & Text

Text classification, translation, QA, summarization, dialogue, sentiment analysis, language modeling, text corpora

44,732 datasets

NLP & Text

Nemotron Sft Arc Agi V1: Multi-Turn Agentic Reasoning Traces for Visual Puzzles

Nemotron-SFT-ARC-AGI-v1 is a supervised fine-tuning dataset of multi-turn agentic reasoning traces. It was created by NVIDIA using nine open-weight large language models attempting to solve ARC-AGI visual-reasoning puzzles. The dataset was last updated on June 4, 2026.

TextAgentic ReasoningArc AgiLlm TrainingSupervised Fine TuningVisual Reasoning+1

0 views

NLP & Text

Local Code Arena MBPP: Qwen3 1.7B Benchmark Telemetry

Local Code Arena Telemetry captures raw evaluation metrics and execution logs from running the Mostly Basic Python Problems benchmark against the Qwen3 1.7B parameter model. The dataset was created by ShahzebKhoso and last updated on 2026-05-29. It provides a direct point of comparison for evaluating next-generation AI models on consumer hardware.

TabularLlm TelemetryBenchmarkBenchmark ResultsPython Problems+1

0 views

NLP & Text

AGSO Cruise 186: Antarctic Marine Geoscience Core and Seismic Data, 1996/97

Eight sediment cores from Vincennes Bay and 19 from Prydz Bay were collected during the 1996/97 Antarctic season to study ice sheet retreat. About 200 km of seismic data from Vincennes Bay and 900 km from Prydz Bay reveal glacial erosion patterns and moraine structures. This post-cruise report summarizes preliminary results from the AGSO/ANARE marine geoscience program in East Antarctica.

TabularGeospatialSeismic SurveyAntarctic GeosciencePaleoclimateFinanceGlacial GeologyMarine Sediment Cores+1

0 views

NLP & Text

NIV: Over One Million Variation Tuples for Variable Font Generation

Over one million variation tuples derived from variable Google Fonts, used for training the NIV (Neural Axis Variations) model. The dataset comprises per-point displacements for font outlines. It was created by ndvb and was last updated on the platform in June 2026.

TabularTypographyComputer VisionLarge ScaleVariable FontsFont Generation+1

0 views

NLP & Text

SEENEZ GH Trial: Stakeholder Interview Data on Growth Hormone Treatment Preferences

A qualitative dataset from the SEENEZ GH trial, containing interview data from 26 participants. The data was collected by researchers to analyze preferences for continuing or discontinuing growth hormone treatment in adolescents with transient idiopathic isolated growth hormone deficiency. The dataset was uploaded by figshare admin karger and last updated on April 22, 2026.

TextExcelGrowth HormoneHealthcareClinical TrialsPediatric EndocrinologyQualitative Research+1

0 views

NLP & Text

Benthic Sediment Surveys of Darwin and Bynoe Harbours (2017)

Inner Darwin Harbour and shallow water areas in and around Bynoe Harbour were surveyed from 29 May to 16 August 2017. The project collected 285 seabed sediment samples for grain size, inorganic elemental, and organic matter analyses, alongside seagrass and hardground observations. This work was part of a four-year (2014-2018) science program led by the Northern Territory Government and funded by the INPEX-led Ichthys LNG Project, in collaboration with Geoscience Australia and the Australian Institute of Marine Science.

TabularGeospatialGrain Size AnalysisMarine HabitatBenchmarkBenthic SedimentEnvironmental Baseline+1

0 views

NLP & Text

MedSP1000: Standardized Patient Cases for Evaluating Clinical AI Agents

MedSP1000 is an interactive benchmark derived from standardized patient cases for evaluating large language models as clinical agents. The dataset, created by byrLLCC and described in a 2026 paper, focuses on dynamic, multi-turn clinical encounters rather than static medical question-answering.

TextStandardized PatientBenchmarkLlm EvaluationHealthcareClinical AiMedical Benchmark+1

0 views

NLP & Text

Nemotron-SFT-Math-v4: Model-Generated Solutions for Mathematical Problems

Nemotron-SFT-Math-v4 is a large-scale mathematical reasoning dataset containing model-generated reasoning trajectories. Solutions were generated using DeepSeek-V4-Pro on High inference mode. The underlying problems are sourced from the nvidia/Nemotron-Math-v2 dataset, which contains high-quality mathematical problems derived from the Art of Problem Solving (AoPS) community and Math StackExchange/MathOverflow.

TextMathematical ReasoningModel GeneratedProblem SolvingNlp TrainingLarge ScaleSynthetic+1

0 views

NLP & Text

Dataset for “Materiality, Symbolism, Technology, and Politics: A ‘Four‑Lens Prism’ Model f

Six CSV files support the analysis of ethnic pictorial manuscripts from Yunnan-Guizhou. The data includes coding of agricultural tool morphology across five manuscript versions, symbol-ethnic group co-occurrence frequencies, and a policy-artifact time series from 1730 to 1790. Author Xin Wu published this dataset on figshare in 2026 under a CC-BY-4.0 license.

TabularTime SeriesZIPArtifact AnalysisEthnic StudiesHistorical ManuscriptsYunnan GuizhouCultural Heritage+1

0 views

NLP & Text

Content Analysis Database for News Coverage of Caitlyn Jenner

A 38.7 KB Excel database supporting a 2016 master's thesis and a 2019 book chapter on news framing. It was created by Edrei Álvarez-Monsiváis and last updated on 2026-05-15. The data likely contains coded content from news articles about celebrity Caitlyn Jenner.

TabularExcelContent AnalysisTransgender CoverageNews MediaCelebrity Studies+1

0 views

NLP & Text

Meta Muse Spark Distilled 5K: Synthetic Reasoning Traces

May 2026 saw the creation of 5,000 unique synthetic examples designed to teach step-by-step reasoning. The dataset was programmatically generated by gss1147 to mirror the thinking style of Meta's Muse Spark frontier model. It contains reasoning traces structured around the steps: Understand, Plan, Execute, and Verify.

TextAi TrainingNatural Language ProcessingInstruction FollowingSyntheticSynthetic Reasoning+1

0 views

NLP & Text

dataBERT: AI-Assisted Framework for Sustainable Investment Concept Discovery

59.6 KB of data supporting an AI-assisted framework for inductive theory building in sustainable investment research. The dataset, created by Gunawan Wibisono and last updated in April 2026, was derived from a Scopus-screened corpus of academic literature. It models an integrative conceptual architecture organized around cognitive, structural, and bridge mechanisms.

TabularExcelSustainable InvestmentText AnalysisEsgBertopicNatural Language ProcessingConceptual Fragmentation+1

0 views

NLP & Text

Plastic Injection Process Simulation Results and Error Metrics

5.5 KB of tabular data presents numerical simulation results for a convection-diffusion model used in plastic manufacturing. The dataset, created by Ahmed M. Abed, contains error metrics and outcomes from a mathematical poka-yoke simulator designed to reduce defects. It was last updated in April 2026.

TabularExcelSimulation ResultsThermal ConvectionProcess OptimizationPlastic Manufacturing+1

0 views

NLP & Text

Plastic Manufacturing Convection-Diffusion Simulation Results

Ahmed M. Abed created a 5.5 KB Excel dataset containing numerical simulation results for a convection-diffusion model in plastic manufacturing. The data includes tabular and graphical outcomes from a mathematical poka-yoke simulator, used to analyze defect causes. The dataset was last updated in April 2026.

TabularExcelConvection Diffusion ModelingPlastic InjectionNumerical AnalysisPoka YokeManufacturing Process Simulation+1

0 views

NLP & Text

Mat-PYS Control System Pseudocode for Plastic Manufacturing

A 9.5 KB Excel file contains pseudocode for the Mat-Poka-Yoke System (Mat-PYS), a control mechanism for plastic injection molding. The system was developed by Ahmed M. Abed and last updated in April 2026. It mathematically models convection-diffusion to reduce defects and improve machine efficiency.

TabularExcelSimulation DataConvection DiffusionPoka YokeManufacturing Control+1

0 views

NLP & Text

Vaani Noise Event Timestamps: Multilingual Speech from India

India's linguistic diversity across all districts is captured in this derived dataset from Project Vaani, a large-scale multilingual speech initiative by IISc Bangalore and ARTPARK. The dataset contains noise event timestamps and is actively being built, with a current subset of a planned corpus of approximately 167 hours of training data. The dataset page was last updated on 2026-06-05.

Audio🇮🇳 IndiaMultilingualLarge ScaleNatural Language ProcessingNoise EventsAudio Processing+1

0 views

NLP & Text

Keleti1 Rye: Morphological and Reproductive Traits of Diploid and Tetraploid Plants

Individual plant-level measurements of growth and yield-related traits for diploid (2x; 'Keleti1') and tetraploid (4x; 'Keleti1T') perennial rye genotypes. The dataset is 14.2 KB in size and was authored by Ahmed Ali Hamad, last updated on May 13, 2026. Missing data are indicated as 'NA' and values represent direct measurements or derived means per plant.

TabularExcelPloidyPlant BiologyReproductionMorphologyRye Traits+1

0 views

NLP & Text

Pleurotus Ostreatus Mating Type Primer Sets and Genomic Analysis

Primer sets and genomic data for characterizing the multiallelic mating-type loci in the edible oyster mushroom Pleurotus ostreatus. Yi-Yun Lee developed this resource, which includes analysis of 12 haplotypes identifying 11 A and 12 B alleles. The dataset was last updated in April 2026.

TabularExcelFungal GeneticsMating Type LociGenomic AssemblyPleurotus Ostreatus+1

0 views

NLP & Text

BOREAS HYD-03: Subcanopy Solar Radiation on Snow

Several pyranometers collected solar radiation data for 3-4 consecutive days in jack pine (1994) and black spruce and aspen forests (1996). The BOREAS HYD-03 team used this array to test the hypothesis that energy transfer and snow water equivalent vary spatially with canopy closure. Data quality is noted as good due to generally clear days and daily maintenance of the radiometers.

TabularTime SeriesZIPTextBoreal EcosystemSolar RadiationSnow Water EquivalentEnergy TransferForest Hydrology+1

0 views

NLP & Text

BOREAS Landsat TM Level-3s: Scaled At-Sensor Radiance Imagery

Landsat TM data from 22-Jun-1984 to 30-Jul-1996 provides spatially extensive information for the BOREAS study areas. The imagery includes radiant energy, detailed land cover, and biophysical parameter maps such as FPAR and LAI. It primarily covers the Northern and Southern Study Areas (NSA/SSA) of the Boreal Ecosystem-Atmosphere Study.

ImageGeospatialZIPTextLandsatBoreal EcosystemComputer VisionLand CoverBiophysical Parameters+1

0 views

PreviousPage 383 of 2232Next