DataSalon

Discover quality datasets for AI training — aggregated from 40+ platforms, curated by AI.

ProductSearch Datasets Browse Topics Rankings Community API / MCP

ResourcesDocumentation Blog Changelog Status

LegalPrivacy Policy Terms of Service Cookie Policy

NLP & Text Datasets | DataSalon

All Categories

📝

NLP & Text

Text classification, translation, QA, summarization, dialogue, sentiment analysis, language modeling, text corpora

44,462 datasets

NLP & Text

Milky Way Project IR Bubble Catalog: 5,106 Crowdsourced Star Formation Sites

A 2013 catalog of 5,106 infrared bubbles in the Milky Way, created by NASA HEASARC based on citizen scientist classifications from The Milky Way Project. The catalog provides consensus parameters for bubble positions, radii, thicknesses, eccentricities, and position angles, with each object measured by at least five individuals. This first data release includes bubbles that rediscover 86% of objects from three prior catalogs and identifies 29% of bubbles as nested or rim-associated.

TabularInfrared BubblesCitizen ScienceStar FormationAstronomyMilky Way+1

0 views

NLP & Text

Envigado Personería Publication Schema: Information Catalog for Proactive Disclosure

ESQUEMA DE PUBLICACIÓN DE INFORMACIÓN, PERSONERÍA DE ENVIGADO is a structured catalog from the Colombian open data platform www.datos.gov.co. It describes information published and to be published by the obligated entity, in accordance with proactive disclosure principles under Law 1712 of 2014. The dataset was last updated on 2026-05-18 and includes 15 columns detailing format, responsible area, description, and access methods.

TabularCSVXMLJSONGovernment MetadataProactive DisclosureInformation SchemaPublication Catalog+1

0 views

NLP & Text

Experimental Data on AI Automation and Creative Agency in Music Co-Creation

A dataset from a 2026 figshare study by Wenting He investigating the impact of generative AI on human creative agency. The data likely contains results from a between-subjects experiment with 162 participants who completed a music co-creation task. It examines how AI automation level affects subjective task load, psychological ownership, and state sense of agency, moderated by musical expertise.

TabularAudioGenerative AiHuman Ai CollaborationMusic CreationExperimental PsychologyCreative AgencySynthetic+1

0 views

NLP & Text

Índice de Información Clasificada y Reservada

An inventory of public information generated, obtained, acquired, or controlled by the Institute for the Development of Antioquia (IDEA) that has been classified as confidential or reserved. The dataset includes 16 columns detailing the legal basis, responsible parties, formats, and classification terms for each record. It was last updated on 2026-05-18 and is hosted by the Colombian open data portal www.datos.gov.co.

TabularCSVXMLJSONGovernment TransparencyInformation ClassificationPublic AdministrationLegal Framework+1

0 views

NLP & Text

Modern Chinese to Lu Xun Style Dataset with 7,000 Text Pairs

7,000 Chinese text pairs for modern Chinese to Lu Xun style rewriting. The modern Chinese source side was generated by DeepSeek V4 Flash through an API-based modernization pipeline, while the target side contains Lu Xun style Chinese text. The dataset was created by liuyanliang and last updated on Hugging Face in June 2026.

TextLiterary StyleText GenerationStyle TransferChinese LanguageNatural Language ProcessingSynthetic+1

0 views

NLP & Text

Frames in Groups: Media Discourse on Ukraine Migration from Top Reposts

200 social media posts represent the top 20 most reposted items each month over a 10-month period. The data is annotated with 5 generic and 5 issue-specific frames, such as Conflict and Migration Flows, across four political groups. Author Tomasz Piróg released this dataset under a CC-BY-4.0 license on figshare.

TabularExcelFinanceMedia FramesPolitical discourseSocial Media AnalysisUkraine Migration+1

0 views

NLP & Text

Sogamoso Chamber of Commerce Index of Classified and Reserved Information

An inventory of public information generated or controlled by the Sogamoso Chamber of Commerce that has been classified as confidential or reserved. The dataset includes 13 columns detailing the content, legal basis, responsible parties, and classification terms for each record. It is published by the Colombian open data portal, www.datos.gov.co, and was last updated in May 2026.

TabularCSVXMLJSONGovernment TransparencyPublic InformationDocument ClassificationAdministrative Records+1

0 views

NLP & Text

ResearchMath-Reasoning-194K: Model-Generated Solutions for Advanced Math Problems

193,938 long-form reasoning traces and solutions for research-level mathematical problems, released alongside ResearchMath-14k. The dataset contains model-generated solution attempts, each with a problem statement, a chain-of-thought reasoning trace, and a final response. It was authored by 'amphora' and last updated on Hugging Face in June 2026.

TextReasoning TracesMathematicsChain Of ThoughtAi TrainingResearch LevelSynthetic+1

0 views

NLP & Text

BMW-Chandra: Multi-scale Wavelet Catalog of 21,325 X-ray Sources

The Brera Multi-scale Wavelet Chandra Source Catalog (BMW-Chandra) contains 21,325 X-ray sources identified from 136 Chandra ACIS-I observations public as of March 2003. The NASA HEASARC created this table in September 2008 based on the CDS catalog J/A+A/488/1221, making it the largest compilation of Chandra sources at its publication date. It includes source positions, count rates in multiple energy bands, flux estimates, and cross-matches with other astronomical catalogs.

TabularChandraX Ray SourcesWavelet DetectionAstrophysics CatalogAstronomy+1

0 views

NLP & Text

Fermi GBM Trigger Catalog: Gamma-Ray Burst and Flash Observations

All triggers observed by the 14 detectors of the Fermi Gamma-ray Burst Monitor (GBM), including 12 sodium iodide and 2 bismuth germanate detectors. The catalog is automatically updated within about a day of data processing by NASA's HEASARC, with latency requirements of 1 day for triggers and 3 days for bursts. Data originates from the Fermi GBM Instrument Operations Center and Fermi Science Support Center, provided as FITS files.

TabularTime SeriesSatellite DataSpace MissionsGamma Ray AstronomyAstrophysics+1

0 views

NLP & Text

Cauca Chamber of Commerce Public Information Asset Registry

REGISTRO DE ACTIVOS DE INFORMACIÓN CÁMARA DE COMERCIO DEL CAUCA is a public information asset inventory from the Cauca Chamber of Commerce in Colombia. The dataset likely contains metadata about public information generated or controlled by the Chamber, including its format, language, and category. It was last updated on 2026-05-18.

TabularCSVXMLJSONGovernment DataColombiaGovernment MetadataPublic Information RegistryMetadata CatalogData Catalog+1

0 views

NLP & Text

Spatial Interpolation Accuracy for Australian Seabed Sediments Across Coordinate Systems

Geoscience Australia data examines the effects of eight spatial reference systems on the predictive accuracy of spatial interpolation methods for seabed sediments. The study applied inverse distance squared and ordinary kriging to marine data within the Australian Exclusive Economic Zone, assessing accuracy via cross-validation and map visualization. Results indicate negligible differences in predictive accuracy between the tested geographic coordinate systems and map projections.

GeospatialCoordinate SystemsSpatial InterpolationFinanceGeospatial AnalysisAustralia EezMarine Sediments+1

0 views

NLP & Text

Connecticut Hazardous Waste Manifests 1984-2008

Over 100,000 paper manifests were received annually, detailing hazardous waste shipments within Connecticut. The dataset includes generator, transporter, and treatment facility information, compiled by the Connecticut Department of Energy and Environmental Protection. Records span from 1984 to 2008.

TabularCSVXMLJSONConnecticutRcraHazardous WasteGeneratorWaste ManagementManifestSynthetic+1

0 views

NLP & Text

Pereira Pedestrian Network: Urban Connectivity and Public Space Access

Red Peatonal Pereira is a dataset describing the pedestrian network of the city of Pereira, Colombia, sourced from www.datos.gov.co. The data is intended to connect the urban territory, making communication nodes, facilities, and public spaces accessible to citizens traveling on foot. The dataset was last updated on 2026-05-18 18:28:19.

TabularGeospatialCSVXMLJSONCity PlanningTransportationUrban InfrastructurePedestrian Network+1

0 views

NLP & Text

Net Changes in Phyllostomid Bat Genera Distribution Under Climate Scenarios

Net changes in the distribution areas of phyllostomid genera in the Neotropics are reported under different climate change scenarios for 2040. The dataset was authored by Daryl Cruz and published on figshare under a CC-BY-4.0 license. It was last updated on May 22, 2026.

TabularGeospatialCSVNeotropicsSpecies DistributionBATSPhyllostomidae+1

0 views

NLP & Text

KernelBench-Hard: Frontier AI Model GPU Kernel Submissions

June 2026 submissions from 8 frontier coding models, including Claude Opus 4.8 and GPT-5.5, autonomously writing CUDA/Triton GPU kernels. Each model had one unlimited-time run per problem to write the fastest kernel for an NVIDIA RTX PRO 6000 Blackwell GPU, graded as peak_fraction of the hardware roofline. The dataset was created by Infatoshi and hosted on Hugging Face.

TabularTritonCudaPerformance BenchmarkAi AgentsGpu Kernels+1

0 views

NLP & Text

Smartwatch vs Self-Reported Sleep Data from 130 Participants

A validation dataset comparing smartwatch-measured and self-reported sleep parameters from 130 participants over 841 sleep instances. The data was collected between November 2023 and June 2024 from participants wearing three generations of Garmin smartwatches. It was authored by Christina T. Saliba and shared under a CC-BY-4.0 license.

TabularSleep QualityWearable ValidationSleep ResearchHealthcareHealth TechnologySmartwatch Data+1

0 views

NLP & Text

Trehalose Treatment Effects on Shine Muscat Fruit: Transcriptomic and Metabolomic Data

Yuanxin Cheng's dataset contains results from a study on trehalose enhancing postharvest Shine Muscat fruit resistance to gray mold (Botrytis cinerea). The data includes 2201 differentially expressed genes and 383 differentially expressed metabolites identified through comparative omics analyses. The dataset was last updated on 2026-05-01 and is shared under a CC-BY-4.0 license on figshare.

TabularExcelTranscriptomicsPlant PathologyHealthcareBotrytis cinereaPostharvest Fruit+1

0 views

NLP & Text

Kimberley Marine Park 30 m Bathymetry and Seafloor Morphology

Kimberley Marine Park in Australia's Commonwealth waters contains a 30-meter resolution bathymetric grid and derived morphological surfaces. The data was processed by Geoscience Australia using a two-part seafloor classification scheme that categorizes slope into Plains, Slopes, and Escarpments. This release supports the management of Australia's network of 58 marine parks covering 3.3 million square kilometres.

Geospatial🇦🇺 AustraliaZIPMarine ParkMarine BathymetryFinanceLarge Scale+1

0 views

NLP & Text

Police Patrol Applicant Pre-Registrations with Demographic and Geographic Data

Pre-registration data for the 2025 Police Patrol Officer recruitment call in Colombia, sourced from datos.gov.co. The dataset includes applicant demographics such as marital status, gender, academic level, and geographic location. It was last updated on 2026-05-18.

TabularGeospatialCSVXMLJSONApplicant DemographicsTabular DataPolice RecruitmentGeospatial Data+1

0 views

PreviousPage 275 of 2219Next