DataSalon

Discover quality datasets for AI training — aggregated from 40+ platforms, curated by AI.

ProductSearch Datasets Browse Topics Rankings Community API / MCP

ResourcesDocumentation Blog Changelog Status

LegalPrivacy Policy Terms of Service Cookie Policy

NLP & Text Datasets | DataSalon

All Categories

📝

NLP & Text

Text classification, translation, QA, summarization, dialogue, sentiment analysis, language modeling, text corpora

44,799 datasets

NLP & Text

Traditional Owner Interests in Australian Offshore Renewable Energy Development Areas

Australian Indigenous communities adjacent to Offshore Renewable Energy (ORE) wind farm development areas are the focus of this desktop study. The work compiled information on cultural values, Sea Country plans, Indigenous Cultural Intellectual Property, and preferred engagement methods. The raw spreadsheet is withheld due to cultural sensitivities, but a synthesis is available in the NESP MaC Project 3.3 final report.

TabularCultural ValuesOffshore Renewable EnergyIndigenous InterestsDesktop StudySea Country Plans+1

0 views

NLP & Text

Psychophysical Dissimilarity Ratings for Qualia Diversity Analysis

Raw participant ratings and individual dissimilarity matrices analyzed for a 2026 manuscript on measuring qualia diversity. The dataset includes CSV and H5 files totaling 1.1 MB. Kyoko Kusano and colleagues created this data to apply category-theoretic indices to psychophysical experimental results.

TabularCSVHDF5PsychophysicsQualia DiversityCategory TheoryDissimilarity RatingsSynthetic+1

0 views

NLP & Text

Australia's Identified Mineral Resources 2009 with World Rankings and Resource Life

2009 data from Geoscience Australia details Economic Demonstrated Resources for 18 mineral commodities that increased in 2008, including black coal and iron ore, while nine others decreased. The report provides world rankings, showing Australia's resources of brown coal, nickel, and uranium are the world's largest, and analyzes resource life estimates for major commodities. It also discusses exploration expenditure trends for the 2008 calendar year.

Tabular🇦🇺 AustraliaMineral ResourcesEconomic GeologyFinanceCommodity Trends+1

0 views

NLP & Text

OCR2 Hardest10K: Aggregated Step Labels for AI Reasoning Traces

10,000 reasoning traces from the hardest OCR2 questions, aggregated from multiple AI models. The dataset was created by JingweiNi and last updated on May 30, 2026. Each row contains step-level labels from Qwen3.5-122B and GPT-5.5 models, formatted as aligned arrays.

TabularAi EvaluationReasoning TracesAggregated LabelsOCRStep VerificationSynthetic+1

0 views

NLP & Text

Australia's Mineral Resources and Exploration Expenditure for 2009

Geoscience Australia's 2010 report provides estimates of the country's identified mineral resources as of December 2009 for major and minor commodities. These long-term resource estimates are compared with short-to-medium term industry ore reserves and include mine production data from the Australian Bureau of Agricultural and Resource Economics and Sciences. The report also analyzes mineral exploration expenditures for 2008-09 and 2009, presenting trends and Australia's world ranking based on United States Geological Survey information.

Tabular🇦🇺 AustraliaGeologyMineral ResourcesEconomic AnalysisMining Industry+1

0 views

NLP & Text

VSTAT5S: 500 Spatiotemporal Reasoning Questions on 450 Synthetic 5-Second Videos

A 2026 dataset by ShushengYang contains 500 question-answer pairs for evaluating multimodal AI models. It is a short-video companion to VSTAT, featuring 450 synthetic video clips trimmed to approximately 5 seconds each. The dataset is packaged for use with the lmms-eval framework.

TabularVideoVideo ReasoningMultimodal QaSpatiotemporal ReasoningSynthetic VideoSynthetic+1

0 views

NLP & Text

Local Code Arena Starcoder 15B: MBPP Benchmark Telemetry

Raw evaluation metrics, execution telemetry logs, and structural syntax outputs from running the Mostly Basic Python Problems (MBPP) benchmark against the StarCoder 15B base model. This partition documents scaling limits of unaligned foundational weights in conversational benchmarking loops. The dataset was authored by ShahzebKhoso and last updated on 2026-05-28.

TabularBenchmark EvaluationLlm TelemetryBenchmarkCode GenerationLarge ScalePython Problems+1

0 views

NLP & Text

Winter Bioenergetics in Freshwater Fishes: Literature Review and Model Data

403 papers from a scoping literature review on cold freshwater fish bioenergetics, compiled by Connor Reeve. The dataset includes two files: one containing extracted data from the reviewed papers and another detailing models from the Fish Bioenergetics 4.0 software. It was last updated on April 28, 2026.

TabularTextExcelBioenergeticsModel EvaluationLiterature ReviewBenchmarkFreshwater fish+1

0 views

NLP & Text

Kimi-K2.6-Technical-Reasoning-AddOn-3300x: AI-Generated Technical Reasoning Traces

A dataset of 3300 technical reasoning traces generated by the Kimi K2.6 teacher model. It was designed as an add-on for downstream supervised fine-tuning experiments, focusing on math, graduate-level science, coding, and debugging prompts. The dataset was authored by trjxter and last updated on June 3, 2026.

TextTechnical ReasoningSftCodingScienceMathSynthetic+1

0 views

NLP & Text

Snatch Lift Biomechanics from Inertial Motion Capture and EMG Data

Shu Zhang's dataset on figshare contains biomechanical data from 23 youth weightlifters aged 15–18 performing snatch lifts at 70%, 80%, and 90% of their 1RM. Data includes inertial motion capture and EMG recordings, with deep muscle forces and joint loads calculated using OpenSim. The dataset was last updated on April 14, 2026.

MultimodalZIPWeightliftingBenchmarkBiomechanicsMotion CaptureJoint LoadMuscle Force+1

0 views

NLP & Text

Australian Offshore Mineral Locations Map from 2006

Australia's offshore mineral occurrences and deposits within its 200-nautical-mile exclusive economic zone and extended continental shelf. The map draws together data from published and unpublished marine research surveys and government records, showing resources like manganese nodules, heavy mineral sand, and diamonds. It was produced collaboratively by Geoscience Australia, CSIRO, and state and territory geological surveys.

Geospatial🇦🇺 AustraliaMineral LocationsFinanceLarge ScaleMarine GeologyOffshore resources+1

0 views

NLP & Text

Providence Police Case Log with Offense Details for the Past 180 Days

Recorded state and municipal offenses from the AEGIS records management system of the Providence Police. The data is published by data.providenceri.gov and was last updated on April 3, 2026. A single case can contain multiple offenses, and the log excludes certain sensitive cases to protect victims and juveniles.

TabularCSVXMLJSONSafetyPolicePolice Case LogLaw EnforcementEconomyCrime IncidentsPublic SafetyCrime+1

0 views

NLP & Text

Local Code Arena Starcoder2 3B: MBPP Benchmark Telemetry

StarCoder2 3B base model evaluation on the Mostly Basic Python Problems (MBPP) benchmark. The dataset contains raw evaluation metrics, execution telemetry logs, and structural syntax outputs captured from automated conversational pipelines. It was authored by ShahzebKhoso and last updated on May 28, 2026.

TabularTelemetry LogsPython BenchmarkBenchmarkLlm EvaluationCode Generation+1

0 views

NLP & Text

Loss Function Visualization Data for a Line Chart

Raw data for a line chart visualizing a loss function, as referenced in Figure 7 of a published research article. The dataset was authored by Ruishi Liang and published on figshare in May 2026. It is a small file of 28.1 KB.

TabularExcelMachine LearningEducationalVisualizationLoss Function+1

0 views

NLP & Text

Eucheumatopsis Seaweed Molecular, Chemical, and Morphological Data from Yucatán, Mexico

Molecular, chemical, and morphological data for the seaweed species Eucheumatopsis isiformis, collected from March to November 2022 in Yucatán, Mexico, with a comparison specimen from Florida, USA. The dataset includes gene sequencing for haplotype construction, carrageenan yield and sulfate content measurements, and morphological characterizations. It was authored by Monserrat López-Yllescas and is available under a CC-BY-4.0 license.

TabularZIPSeaweed MorphologyCarrageenan ContentMolecular AnalysisMarine Biology+1

0 views

NLP & Text

Correspondence of CTCAE-Xemio Side Effects and QoL Questionnaire Items

A mapping table linking Common Terminology Criteria for Adverse Events (CTCAE) codes for side effects to corresponding items in Quality of Life (QoL) questionnaires, specifically the EORTC QLQ-C30. The dataset was authored by Maria-Angeles Fuentes-Expósito and last updated on May 13, 2026. All questionnaire results referenced are from a single time point, T=12.

TabularExcelQuestionnaire MappingSide EffectsQuality of LifeClinical Trials+1

0 views

NLP & Text

ADQA-Bench: Audio-Dependent Question Answering Evaluation Set for DCASE 2026

ADQA-Bench is the official evaluation set for the DCASE 2026 Challenge Task 5: Audio-Dependent Question Answering. It focuses on addressing textual hallucination in Large Audio-Language Models by requiring models to answer questions based on audio perception rather than linguistic priors. The dataset was authored by Harland and last updated on May 29, 2026.

AudioMultimodalDcase ChallengeEvaluation BenchmarkBenchmarkQuestion AnsweringAudio Language Models+1

0 views

NLP & Text

Nest Predation Rates in a U.K. Deciduous Forest Fragment

Noah Atkin from Imperial College London conducted a study on nest predation in a mixed deciduous forest fragment bordering open grassland. Artificial nests containing quail and plasticine eggs were placed at ground and arboreal levels to test the edge effect hypothesis. The dataset likely contains records of predation events and nest locations.

TabularEcologyConservationForest FragmentationNest PredationEdge EffectSynthetic+1

0 views

NLP & Text

Nafie SFT v1: 119,877 Turkish Instruction-Tuning Examples

119,877 prompt-response examples for supervised fine-tuning of Turkish language models. The dataset was created by nafie-ai and focuses on rule-based reasoning, text-grounded question answering, and safe handling of toxic inputs. It was last updated on June 2, 2026.

TextSafetyReasoningTurkish NlpInstruction TuningSupervised Fine Tuning+1

0 views

NLP & Text

Tamil Non-STEM Textbook Corpus with 20.48 Million Words

486 Tamil textbooks containing 20.48 million words, designed to support NLP development. The dataset is part of a larger multilingual educational corpus with over 2.6 billion words across 5,000+ subjects, created by InfoBayAI and last updated in June 2026.

TextMultilingualTextbooksMultilingual CorpusEducationNon StemLarge ScaleNatural Language ProcessingTamil Language+1

0 views

PreviousPage 400 of 2236Next