DataSalon

Discover quality datasets for AI training — aggregated from 40+ platforms, curated by AI.

ProductSearch Datasets Browse Topics Rankings Community API / MCP

ResourcesDocumentation Blog Changelog Status

LegalPrivacy Policy Terms of Service Cookie Policy

NLP & Text Datasets | DataSalon

All Categories

📝

NLP & Text

Text classification, translation, QA, summarization, dialogue, sentiment analysis, language modeling, text corpora

44,808 datasets

NLP & Text

Replication Data for Chinese Police Training and FDI Abroad

Replication data for the paper 'Guarding economic interests abroad: FDI, political instability, and the proliferation of Chinese police training.' The package includes a dataset in Stata format, a Stata do-file for statistical analyses, and supplementary materials with figures and tables. The data was authored by Sae-Phoo, Lin and is hosted by Harvard Dataverse.

TabularPolitical InstabilityChina Foreign PolicyFinanceForeign Direct InvestmentPolice TrainingReplication DataSynthetic+1

0 views

NLP & Text

ArIA: Molecular Simulation Datasets and Code for AI Agent Development

A collection of code and data for reproducing results from the paper 'Molecular Simulations Assisted by an Artificial Intelligence Agent (ArIA)'. The dataset includes directories for model development, prompt generation, and application deployment. It was authored by Supphachok Chanmungkalakul and last updated on 2026-05-18.

TextMultimodalArtificial IntelligenceFine TuningOrcaComputational ChemistryMolecular SimulationSynthetic+1

0 views

NLP & Text

Coronet Hills Copper Mine Survey Notes with Proposed Drill Sites

Geoscience Australia Data provides a 2026 report detailing a plane table and theodolite survey of the abandoned Coronet Hills copper mine in the Northern Territory. The document describes sulphide-bearing lodes mineralized with copper, lead, and arsenic, and includes assay results from dumps and underground workings. It concludes with proposed locations for six diamond drill holes to test extensions of the lodes.

TextGeospatialMineral AssayGeologyMining ExplorationNorthern TerritoryFinanceGeospatial Survey+1

0 views

NLP & Text

Finite Element Simulation Outputs for Dual Twist Channel Angular Extrusion

128.9 MB of simulation data from a study comparing the Dual Twist Channel Angular Extrusion (DTCAE) process to Equal Channel Angular Pressing (ECAP). The dataset includes outputs from 3D Finite Element Method simulations run in DEFORM-3D, analyzing plastic deformation and strain distribution. It was authored by Vikash Ranjan and uploaded in April 2026.

MultimodalCSVDual Twist Channel Angular Extrusion DtcaeFinite Element Method FemEqual Channel Angular Pressing EcapMechanical EngineeringDtcaeSevere Plastic DeformationMetal FormingUltra Fine Grained MaterialsFinite Element SimulationSevere Plastic Deformation Spd+1

0 views

NLP & Text

Antimicrobial Resistant E. coli in Children and Soil from a South African Village

A 2026 case study from Lwamondo village, South Africa, investigates antimicrobial resistance in E. coli using a One Health approach. The research analyzes 47 paired stool and soil samples, yielding 117 and 94 E. coli isolates respectively, with phenotypic and genotypic resistance testing. Authored by Solanka Ellen Ledwaba, the dataset is a published PDF report.

TextOne HealthMicrobiologyHealthcareSouth AfricaAntimicrobial ResistancePublic Health+1

0 views

NLP & Text

Colombian Energy Subsidy Usage Data for Non-Interconnected Zones

Colombian data tracks the use of energy tariff subsidies in non-interconnected zones (ZNI). The dataset includes company-level details on fuel purchases, quarterly spending, and subsidy amounts allocated to different socioeconomic strata. It is published by datos.gov.co and was last updated on 2026-05-18.

TabularCSVXMLJSONColombiaEnergy SubsidiesPublic FinanceUtility Data+1

0 views

NLP & Text

Julia Creek Sheet: Geological Map Explanatory Notes for Western Queensland

A 1961 geological mapping program by the Bureau of Mineral Resources' Great Artesian Basin Party produced this dataset. It covers the Julia Creek area, forming the western and northern margins of the Eromanga Sub-Basin in Western Queensland. The data describes Cretaceous rocks overlying a crystalline basement, with small outcrops of Precambrian granite and metamorphics in the southwest, and areas masked by Cainozoic and recent deposits.

TextGeospatial🇦🇺 AustraliaGeologyGeological mapping+1

0 views

NLP & Text

Montreal Parks and Public Spaces Polygon Data

Over 1,495 parks and public spaces across Montreal's boroughs, covering more than 6,412 hectares. The dataset provides surface polygon representations for these areas within the urban fabric. Data is for representational purposes and is not a legal reference for park boundaries.

0 views

NLP & Text

DORIS Ground Station Coordinate Time Series from NASA

Ground-Based Doppler Orbitography by Radiopositioning Integrated on Satellite (DORIS) IDS Station Coordinates Product from NASA CDDIS provides station position time series in STCD format. The dataset is derived from DORIS data analysis by International DORIS Service (IDS) centers and is hosted by the National Aeronautics and Space Administration. One platform indicates a last update date of March 13, 2026.

Time SeriesGeospatialSatellite PositioningEarth ScienceGeodeticsSynthetic+1

0 views

NLP & Text

IfGPT: Bulgarian Language Data for LLM Fine-tuning

The IfGPT Dataset is developed within the project IfGPT: Infrastructure for Fine-tuning Pre-trained Large Language Models. It aims to establish a freely accessible infrastructure for the selection and pre-processing of large datasets for Bulgarian as well as tailored data for particular industries. The dataset is authored by DCL-IBL and was last updated on Hugging Face in June 2026.

TextBulgarian LanguageLanguage ModelFine TuningOpen DataText Processing+1

0 views

NLP & Text

IfGPT: Bulgarian Language Corpus for Fine-Tuning LLMs

IfGPT is a dataset developed to establish a freely accessible infrastructure for fine-tuning large language models for Bulgarian. The project aims to provide tailored data for specific industries and purposes. It was created by DCL-IBL and was last updated on June 3, 2026.

TextIfgptBulgarian LanguageLanguage ModelAi TrainingFine TuningText Corpus+1

0 views

NLP & Text

Rakhine and Myanmar Parallel Corpus for Machine Translation and NLP

A parallel dataset for Rakhine and Standard Burmese (Myanmar) language processing. The dataset was created by the author 'rakhine-nlp' and was last updated on the platform in June 2026. It is intended for machine translation, language modeling, and dialect analysis.

TextMachine TranslationMyanmar LanguageRakhine LanguageLanguage PreservationNatural Language ProcessingParallel Corpus+1

0 views

NLP & Text

Karachay Words Dataset for Turkic Language Model Training

A basic collection of Karachay words and phrases intended for training and fine-tuning language models for the Turkic language group, specifically the Karachay-Balkar language. The dataset is hosted on Hugging Face by author 'thetemirbolatov' and was last updated on 2026-05-27. Its size category suggests it likely contains between 10,000 and 100,000 entries.

TextKarachay LanguageText GenerationNatural Language ProcessingVocabularyTurkic LanguagesLow Resource+1

0 views

NLP & Text

Objective Activity and Disability Trajectories in Older Adults

7 years of follow-up data from the National Health and Aging Trends Study (NHATS) analyzes 480 community-dwelling older adults. The dataset, created by Jianhui Pan, links objective wrist-worn accelerometry metrics to long-term trajectories of functional disability.

TabularHealthcareAging PopulationLarge ScaleStroke RehabilitationPhysical ActivityAccelerometry+1

0 views

NLP & Text

Accelerometer-Measured Activity and Disability Trajectories in Older Adults

480 community-dwelling older adults from the National Health and Aging Trends Study were monitored for 7 years using wrist-worn accelerometers to link objective activity patterns with functional disability trajectories. The dataset, created by Jianhui Pan and published in 2026, includes weighted data representing a population of 1.9 million.

TabularHealthcareStroke RecoveryLarge ScalePhysical ActivityAging HealthAccelerometry+1

0 views

NLP & Text

Daily Oracle: A Continuous Benchmark for LLM Future Prediction Using News QA Pairs

20,085 true/false and 18,262 multiple-choice questions automatically generated from daily news headlines. The dataset, created by agentic-learning-ai-lab, spans from January 1, 2020, to May 26, 2026, and is designed to evaluate how large language models' prescient capabilities evolve over time.

TextNews QaTemporal BenchmarkBenchmarkLlm EvaluationPrediction CapabilitySynthetic+1

0 views

NLP & Text

Public Wifi User Connectivity Logs for Tunja Municipality, Boyacá

Datos.gov.co hosts public georeferenced data on users connected to various Wifi Zones in the Municipality of Tunja, Boyacá, reported by Primary Data Generating Units (UPGD). The dataset includes columns for ZONA, SECTOR, ID, FECHA Y HORA, ANIO, FECHA, LATITUD, HORA, and LONGITUD. It was last updated on 2026-05-18.

TabularGeospatialCSVXMLJSONWifi UsagePublic WifiUser ConnectivityMunicipal Data+1

0 views

NLP & Text

Stranded Beach Dune Chronology in South-east South Australia, 0-250 ka

South-east South Australia's stranded coastal barriers preserve a record of sea-level variations over the past 800,000 years. This dataset presents new single-aliquot regenerative-dose optically stimulated luminescence (SAR-OSL) ages for quartz extracts from these dunes, extending the tested age range to 0-250 ka. The data, sourced from Geoscience Australia, compares these ages with an existing independent chronology to validate the SAR-OSL dating method.

TabularSea level changeGeochronologySouth AustraliaQuartz Osl DatingCoastal Geomorphology+1

0 views

NLP & Text

Voyager 2 PLS M Mode Ion Spectra from Jupiter Flyby, 1979

Voyager 2 Plasma Spectrometer (PLS) data from the July 1979 Jupiter flyby. The dataset contains high-energy-resolution current ion spectra for protons across 128 logarithmic energy channels from 10 eV to 5950 eV, measured in femto-amperes. It was produced by NASA, with instrument details described in a 1977 Space Science Review reference.

TabularSpace PhysicsJovian MagnetosphereIon SpectraVoyager 2Plasma+1

0 views

NLP & Text

VG1 LECP: Voyager 1 Charged Particle Measurements Near Jupiter

Voyager 1's Low Energy Charged Particle experiment data collected in the vicinity of Jupiter. The dataset includes 48.0-second rate and flux measurements for electrons and ions across almost 100 instrument channels, with particles including protons, alpha particles, and light to heavy nuclei. NASA produced this globally calibrated dataset, last updated on the platform in April 2026.

Time SeriesSpace PhysicsCharged ParticlesJupiterSensor DataVoyager 1+1

0 views

PreviousPage 401 of 2236Next