DataSalon

Discover quality datasets for AI training — aggregated from 40+ platforms, curated by AI.

ProductSearch Datasets Browse Topics Rankings Community API / MCP

ResourcesDocumentation Blog Changelog Status

LegalPrivacy Policy Terms of Service Cookie Policy

NLP & Text Datasets | DataSalon

All Categories

📝

NLP & Text

Text classification, translation, QA, summarization, dialogue, sentiment analysis, language modeling, text corpora

44,649 datasets

NLP & Text

Eromanga Basin Hydrogeological Inventory for the Great Artesian Basin

The Eromanga Basin dataset from the Australian Ocean Data Network contains descriptive attribute information for spatial groundwater features in the Great Artesian Basin. It covers over 1,250,000 square kilometres in central and eastern Australia, with data grouped into themes like location, geology, hydrogeology, and land use. The dataset was last updated on 2026-04-16.

Geospatial🇦🇺 AustraliaGeologyGroundwaterHydrogeology+1

0 views

NLP & Text

U.S. Boxing Injury Emergency Department Visits, 2000-2023

An estimated 362,869 boxing injuries treated in U.S. emergency departments from 2000 to 2023 were analyzed from the National Electronic Injury Surveillance System. The study by Jenna Tsuzaki, published on figshare in 2026, reports injury rates, demographics, diagnoses, and mechanisms. It finds a 46.6% increase in the injury rate over the period, with fractures being the most common diagnosis.

TabularEmergency MedicineEpidemiologyLarge ScaleBoxingSports InjuriesPublic Health+1

0 views

NLP & Text

Vulcan Transect: Crustal Seismic and Gravity Data from Australia to the Timor Trough

Wide-angle seismic data from ocean bottom seismographs, together with gravity and deep marine reflection profiling data, define crustal-scale features along the Vulcan transect in northern Australia. The dataset, provided by Geoscience Australia, outlines the crustal and upper mantle architecture across the boundary between the Australian and SE Asian plates. It includes interpretations of crustal thickness, basin sequences, and evidence of intrusive rocks at depth.

GeospatialGeophysicsAustralia GeologySeismic DataTimor TroughCrustal Structure+1

0 views

NLP & Text

Albany Submarine Canyons: Geomorphology and Origin Off Southwest Australia

The Albany Canyon complex extends 700 km from Cape Leeuwin to east of Esperance, with canyons cutting down 1500-2000 m in places. Data from Geoscience Australia includes information from seismic profiles and describes canyon morphology, orientation, and exposed Jurassic and younger sequences. This dataset was last updated on 2026-04-30.

Geospatial🇦🇺 AustraliaSubmarine CanyonsGeologyMarine GeomorphologySeismic Data+1

0 views

NLP & Text

Shoreline Response to Clustered Storm Events in Southeast Australia

A framework for modelling beach erosion from clustered storms, focusing on two case study areas in southeast Australia: the Adelaide metropolitan coast and Old Bar beach. The dataset integrates coastal geomorphology and engineering approaches, using sub-surface information like boreholes and ground-penetrating radar to estimate sediment volumes. This work is a contribution to the Bushfire and Natural Hazard Cooperative Research Centre project on storm surge resilience.

Time SeriesGeospatialSediment ModellingStorm ErosionNatural HazardsAustralia CoastCoastal Geomorphology+1

0 views

NLP & Text

Elemental Ratios in Cultured Globigerina Bulloides from the Norwegian Sea

Laboratory-grown foraminifera tests of the planktic species Globigerina bulloides were collected from the Norwegian Sea in summer 2022. The dataset provides element/Ca ratios (Mg/Ca, Na/Ca, Sr/Ca) for these cultured specimens and their culturing substrate, measured by Laser Ablation ICP-MS at the University of Southampton in summer 2023. Data corresponds to articles by Sykes et al. (2024 and in submission).

TabularForaminiferaCultured SpecimensElemental RatiosPalaeoceanographyBiomineralisation+1

0 views

NLP & Text

Cogniti Prompt: AI-Generated Data Points

Two data points were generated by Cogniti AI. The dataset is a 441.9 KB file in PNG format, authored by Qiaoying Liang and last updated on May 28, 2026. It is shared under a CC-BY-4.0 license on figshare.

ImageCogniti AiPrompt EngineeringAi GeneratedSynthetic+1

0 views

NLP & Text

FAVA: Fermi All-Sky Variability Analysis Catalog of Flaring Gamma-Ray Sources

A catalog of flaring gamma-ray sources detected by the Fermi Large Area Telescope over 7.4 years, from August 2008 to January 2016. The Fermi All-sky Variability Analysis (FAVA) technique was used to search for flares in weekly time bins across two energy bands. This catalog was produced by NASA and ingested by the HEASARC in July 2017.

TabularTime SeriesAstronomyGamma RayAstrophysicsSpace Telescope+1

0 views

NLP & Text

MMMU Distribution Simulation: Synthetic Multimodal Multiple-Choice Questions

A synthetic dataset of 234 question-answer pairs designed to mirror the distribution of the MMMU-Pro benchmark. It contains 78 unique questions across 30 academic subjects, each presented in three different visual and textual formats. The dataset was created by YiYang109 and last updated on Hugging Face in May 2026.

MultimodalImage TextBenchmarkComputer VisionAcademic SubjectsMultiple Choice QaSynthetic DataMultimodal BenchmarkSynthetic+1

0 views

NLP & Text

Seasonal Water Chemistry in the Swan River Estuary, Western Australia

Seasonal riverine discharge drives large intra-annual variations in temperature (13-29°C) and salinity (3-30) at two sites in the Swan River estuary. Anoxia in bottom waters associated with a salt wedge increased ammonium and phosphate concentrations, especially at the deeper site. The dataset, sourced from Geoscience Australia Data, examines major ions, nutrients, and chlorophyll a to assess nutrient limitations on phytoplankton growth.

TabularTime SeriesWestern AustraliaNutrient CyclingWater QualitySeasonal VariationEstuarine Ecology+1

1 views

NLP & Text

WREED Project: Geochemical Analyses of REE Deposits in Mongolia and China

British Geological Survey data from the WREED project includes analyses of archive rock and soil samples from rare earth element deposits in Mongolia and China. The data characterizes mineralogy, bulk rock geochemistry, and sequential leaching experiments on laterite, weathered rock, and soil overlying carbonatite-related REE deposits. It was collected to determine enrichment and depletion of REE relative to bedrock, the mineral host of REE, and the ease of extraction.

TabularTime SeriesRare Earth ElementsMineralogySequential LeachingCarbonatite DepositsGeochemistry+1

0 views

NLP & Text

Optical Micrographs of Warton Slag Bank Samples from Northwest England (2022-2023)

2022-2023 photographs from an optical microscope using transmitted and reflected light. The images visualize spatial textures and microstructures in samples from the Warton slag bank. The data was collected by John MacDonald and Robin Hilderman of the University of Glasgow and is held by the British Geological Survey.

ImageOptical microscopyMicrostructureGeological SamplesSlag AnalysisFinance+1

0 views

NLP & Text

X-Ray Diffraction Mineralogy Data from UK Slag Banks (2022-2023)

2022-2023 raw X-Ray Diffraction (XRD) analysis data for samples collected from four slag bank field locations: Warton, Glengarnock, Derwent Howe, and Harrington in Scotland and northwest England. The data was collected by John MacDonald and Robin Hilderman of the University of Glasgow for the purpose of identifying sample mineralogy. The dataset is hosted by the British Geological Survey (BGS).

TabularX Ray DiffractionMineralogySlag BanksGeochemistry+1

0 views

NLP & Text

SGI-Bench: Scientist-Aligned LLM Evaluation Across 10 Disciplines

SGI-Bench is a scientist-aligned benchmark for evaluating Scientific General Intelligence in large language models. It spans 10 scientific disciplines and contains more than 1,000 expert-curated samples inspired by Science's 125 Big Questions. The dataset was created by InternScience and last updated on Hugging Face in June 2026.

TextMultidisciplinary ScienceAgentic FrameworkScientific BenchmarkBenchmarkLlm Evaluation+1

0 views

NLP & Text

DCLM Data 200M: Packed GPT-2 Token Sequences for Data-Constrained Pretraining

A dataset snapshot of pre-tokenized sequences used in the paper 'Data-Constrained Language Model Pretraining: Improved Regularization and Scaling Laws'. The data consists of packed GPT-2-tokenized sequences derived from the DCLM corpus, prepared for studying pretraining in data-constrained, compute-rich regimes. The snapshot was uploaded by author zhiwei555 to Hugging Face.

TextLanguage ModelingPretrainingNatural Language ProcessingGpt 2 TokensText Corpus+1

0 views

NLP & Text

4XMM-DR14s: XMM-Newton Stacked X-Ray Source Catalog with 427,524 Unique Sources

The XMM-Newton Serendipitous Source Catalog from Stacked Observations contains 427,524 unique X-ray sources, with 329,972 observed multiple times. Compiled by NASA from 10,336 overlapping XMM-Newton observations taken between 2000 and 2023, it provides source parameters like fluxes and hardness ratios derived from simultaneous fits. The catalog includes 1,807,316 individual flux measurements aimed at studying long-term variability of X-ray emitting sources.

TabularTime SeriesAstronomyCatalogX Ray AstronomySpace Telescope+1

0 views

NLP & Text

Rave: Jeppesen Programming Language Data in Alpaca Format

A dataset created using the Easy Dataset tool for streamlining fine-tuning datasets for Large Language Models. It contains data related to the Jeppesen programming language and is formatted according to the alpaca structure. The dataset was last updated on June 12, 2026.

TextProgramming LanguageAlpaca FormatJeppesenLlm Fine Tuning+1

0 views

NLP & Text

Basement and crustal results from the Bremer Sub-basin, SW Australia and its Antarctic cou

Geoscience Australia's 2004 Southwest Frontiers Survey acquired 2700 km of industry-standard seismic data to study the continental margin. One key finding is that basement velocities are in the 5.2-5.6 km/s range, indicating a composition that is likely not granitic. Results from a conjugate site in Antarctica, obtained by a Russian expedition in 2004-2005, show consistent low velocities, suggesting a ~400km wide zone in the pre-breakup Gondwana supercontinent.

AudioTime SeriesGeospatialSeismic SurveyGeophysicsHydrocarbon ExplorationAustralia AntarcticaCrustal StructureSynthetic+1

0 views

NLP & Text

4XMM-DR14s: XMM-Newton Stacked X-Ray Source Catalog, 2000-2023

The XMM-Newton Serendipitous Source Catalog from Stacked Observations (4XMM-DR14s) contains 427,524 unique X-ray sources compiled from 10,336 overlapping observations taken between 2000 and 2023. It includes 1,807,316 individual flux measurements, with source parameters like fluxes, hardness ratios, and variability information derived from simultaneous fits. The catalog was produced by the National Aeronautics and Space Administration (NASA) and aims to study the long-term behavior of X-ray emitting sources.

TabularTime SeriesX Ray SourcesAstronomySpace ScienceSatellite Observations+1

0 views

NLP & Text

ViroBlend: A Small-Scale Mixed Pre-training Corpus for Genomics

ViroBlend is a 216 megabase pair mixed pre-training corpus introduced by YDXX. It combines broad genomic context with enriched viral signals using source-wise stratified sampling to balance three heterogeneous data sources. The dataset was last updated on 2026-05-29.

TextPre Training CorpusBioinformaticsHealthcareGenomicsNatural Language ProcessingViral Sequences+1

0 views

PreviousPage 340 of 2228Next