DataSalon

Discover quality datasets for AI training — aggregated from 40+ platforms, curated by AI.

ProductSearch Datasets Browse Topics Rankings Community API / MCP

ResourcesDocumentation Blog Changelog Status

LegalPrivacy Policy Terms of Service Cookie Policy

NLP & Text Datasets | DataSalon

All Categories

📝

NLP & Text

Text classification, translation, QA, summarization, dialogue, sentiment analysis, language modeling, text corpora

44,721 datasets

NLP & Text

VANTAGE: Generated Artifacts for Speculative Code-Edit Decoding

Generated artifacts for the VANTAGE research project on speculative decoding for code editing. The dataset stores repository-relative paths and artifacts used by the paper and summarization scripts. It was created by faizancodes and last updated on June 2, 2026.

TextArtifactsNlp ResearchCode GenerationSpeculative DecodingSynthetic+1

0 views

NLP & Text

Active Information Registry of Córdoba Department, 2020

A registry of information assets from the Comptroller General of the Department of Córdoba, Colombia, for the 2020 fiscal year. The dataset is published by www.datos.gov.co and was last updated on 2026-05-18. It includes 12 columns describing the assets, such as their format, location, and classification.

TabularCSVXMLJSONInformation AssetsGovernment InformationPublic RecordsColombia+1

0 views

NLP & Text

Hospital Quality Indicators for 2019, Monthly and Semiannual Metrics

Indicadores de Calidad (Enero - Diciembre 2019) contains all quality indicators for the year 2019 from the E.S.E Hospital Nuestra Señora de la Candelaria. The dataset is hosted on the Socrata platform via www.datos.gov.co and was last updated on 2026-05-18. Columns suggest monthly, semiannual, and annual performance data against set targets.

TabularCSVXMLJSONHospital QualityColombia HealthPerformance Indicators+1

0 views

NLP & Text

B-Plan 1-045: Freiburg im Breisgau Development Plan for Augustinerplatz

Freiburg im Breisgau's official development plan 1-045 for the Augustinerplatz area, provided as a Web Map Service (WMS). The dataset is published by the Bundesamt für Kartographie und Geodäsie. The last update date is unknown.

GeospatialFreiburgDevelopment PlanUrban Planning+1

0 views

NLP & Text

Historic Environment Event Record: Archaeological Interventions in Cornwall and Scilly

Historic Environment Records from Cornwall and Scilly document archaeological and historic building interventions, termed 'Events'. These records, often linked to planning conditions or academic research, are used to update the regional Historic Buildings, Sites and Monuments Record (HBSMR) and are contributed to the national OASIS project and Archaeological Data Service. The data is provided by the Government Digital Service under an Open Government Licence.

TextGrey LiteratureArchaeologyHistoric EnvironmentFieldworkCornwall+1

0 views

NLP & Text

York Council Plan Consultation Responses from September-October 2015

York Council conducted a public consultation between September and October 2015 to gather resident priorities for its Council Plan. Responses were collected via drop-in sessions at West Offices, an online survey, and questionnaires sent to partners and businesses. The published responses have had personal identifiers redacted to comply with data protection requirements.

TextPolicy FeedbackLocal GovernmentPublic ConsultationCivic Engagement+1

0 views

NLP & Text

Tasmanian Plutonic Rock Zircon Ages from SHRIMP Analysis, 2012-2013

Six new zircon U-Pb geochronological data points obtained via Sensitive High-Resolution Ion Micro Probe (SHRIMP) from plutonic igneous rocks in Tasmania. The data were collected between July 2012 and June 2013 by the collaborative Geochronology Project between Mineral Resources Tasmania and Geoscience Australia. Five samples are from the Eastern Tasmanian Terrane and one from the Western Tasmanian Terrane.

TabularTasmania GeologyGeochronologyGeoscience AustraliaFinanceU Pb ZirconPlutonic Rocks+1

0 views

NLP & Text

GBR1_H2p0_B3p2_Cfur_Dnrt: Retired Great Barrier Reef Biogeochemistry Model Results

Version 3.2 of the 1km-resolution regional-scale biogeochemistry and sediments model for the Great Barrier Reef, forced by a 1km hydrodynamic model. The dataset was retired by its authors in February 2026 due to an error causing unrealistic Chlorophyll-a levels. The model ran in near-real-time mode, updating daily, until January 2024 when sensor damage halted river-flow data input.

Time SeriesGeospatialGreat Barrier ReefSedimentsOcean ModelingBiogeochemistry+1

0 views

NLP & Text

The Ten Muallaqat: Annotated Classical Arabic Poems for NLP

Manually collected from the book Fath Al-Kabir Al-Muta‘al fi I‘rab Al-Mu‘allaqat Al-‘Ashr Al-Tiwal, this dataset provides detailed linguistic and semantic annotations for the complete Ten Mu‘allaqat poems. Each entry represents a single verse and includes fields for poet name, verse text, vocabulary explanation, meaning, and grammatical analysis. The dataset was created by SarahALo and last updated on Hugging Face in May 2026 to support Arabic Natural Language Processing and educational applications.

TextClassical LiteratureArabic PoetryNatural Language ProcessingLinguistic AnnotationGrammatical Analysis+1

0 views

NLP & Text

Waste Infrastructure Report and Maps 2010

Data from March 2010 details permitted waste management sites in England and Wales. It combines standard permitting system fields with additional information from permits, re-categorizing sites into more helpful categories. The dataset includes permit references, operator names, site locations, permitted throughput, and activity descriptions.

TabularGeospatialZIPEnvironment AgencyWaste InfrastructureWaste ManagementPermittingInfrastructureEnvironment+1

0 views

NLP & Text

UK Waste Infrastructure Data Tables for 2010

Environment Agency waste permitting data at the end of March 2010. The dataset brings together standard permitting fields with additional information from permits and re-categorizes sites into more helpful categories. It includes details such as Permit Reference, Operator Name, Site Location, Maximum permitted throughput, and activity descriptions.

TabularGeospatial🇬🇧 United KingdomExcelWaste ManagementEnvironmental PermittingInfrastructure+1

0 views

NLP & Text

Sae Lwir

SAE-LWIR is the first publicly available dataset generated with MODTRAN for atmospheric compensation in standoff long-wave infrared hyperspectral imaging. The dataset supports the paper 'Set-Based Transformer for Atmospheric Compensation in Standoff LWIR Hyperspectral Imaging' presented at IGARSS 2026. It was created by researchers from the Universidad Industrial de Santander in Bucaramanga, Colombia.

ImageGeospatialLwirAtmospheric CompensationHyperspectral ImageryModtranSynthetic+1

0 views

NLP & Text

Second Language Emotion Regulation Questionnaire with Longitudinal Validation

811 Chinese tertiary EFL learners provided two waves of data for validating the L2 Emotion Regulation Strategies Questionnaire (L2ERSQ). Huiyuan Gu created this domain-specific instrument, which demonstrates a confirmed 7-factor structure and longitudinal measurement invariance. The dataset, last updated in 2026, supports research on emotion regulation in second language acquisition.

TabularExcelQuestionnaire ValidationPsychometricsEmotion RegulationLongitudinal InvarianceSecond Language Learning+1

0 views

NLP & Text

Measures and Items from a Retail Study on Fixture Crowding and Shopping Aids

A 5.5 KB Excel file containing measures and items used in a multi-study research project on retail design. The dataset, authored by Mathias C. Streicher and last updated in April 2026, examines how in-aisle fixtures and shopping aids like carts influence purchasing behavior through spatial crowding and perceived control.

TabularExcelShopping AidsRetail ResearchSpatial CrowdingConsumer behaviorField Experiment+1

0 views

NLP & Text

Audited Public Entities in Antioquia, Colombia

Antioquia, Colombia's list of public entities subject to audit by the General Comptroller's Office of Antioquia. The dataset includes entity names, locations, and identification codes. It is published by datos.gov.co and was last updated on 2026-05-18.

TabularCSVXMLJSONColombiaAdministrative DataPublic AuditGovernment Entities+1

0 views

NLP & Text

Great Cumbung Swamp Geomorphological Study

A 2026 study by the Australian Ocean Data Network details the fluvial deposition and geomorphology of the Lachlan River terminus. It describes three distinct depositional environments within the swamp: the sinuous Lachlan channel, the extensive Phragmites Marsh, and surrounding overflow areas. The analysis focuses on the river's low-gradient termination and sediment characteristics.

TextGeospatialAustralian HydrologyRiver TerminusFluvial SedimentologyWetland GeomorphologyFinanceLarge Scale+1

0 views

NLP & Text

Great Cumbung Swamp Fluvial Geomorphology Study

Eastern Australia's Great Cumbung Swamp, the terminus of the low-gradient Lachlan River, is documented in this scientific description. The Australian Ocean Data Network provides details on three distinct depositional environments: the Lachlan channel, Phragmites Marsh, and overflow areas. The record was last updated in April 2026.

TextGeospatialAustralian HydrologyRiver TerminusFluvial SedimentologyWetland GeomorphologyFinanceLarge Scale+1

0 views

NLP & Text

Language Decoded: Multilingual Python Code Datasets for Model Training

Language Decoded Data is a multilingual code dataset for the Language Decoded project, part of Cohere's research. The dataset includes configurations for Phase 3 with sizes of 103k, 20k, and 5k rows for Conditions 1 and 2, and Phase 2 configurations remain available for reproducibility. It was last updated by user 'legesher' on Hugging Face on 2026-05-31.

TextMultilingualPythonNlp TrainingMultilingual CodeProgramming Languages+1

0 views

NLP & Text

Experimental Data on Mn(II) Chlorides for Thermal Quenching and Information Encryption

5.4 MB of source data from a study on reversible phase transformations in manganese(II) chlorides. The data, authored by Aibo Li and shared under a CC-BY-4.0 license, supports findings on thermal quenching for high-precision information encryption and thermal energy storage applications. It was last updated on May 26, 2026.

TabularExcelThermal Energy StoragePhase TransformationInformation EncryptionMaterials Science+1

0 views

NLP & Text

Evaluation of CIDA's Senegal Program: Government Reports on International Development

Evaluation reports from Global Affairs Canada, periodically conducted to review the performance of programs and projects. The information gathered helps improve the design and implementation of upcoming international development initiatives. Each evaluation results in a report, with the dataset last updated on 2026-05-28.

TextBenchmarkInternational DevelopmentProgram EvaluationGovernment ReportsSyntheticSENEGAL+1

0 views

PreviousPage 371 of 2232Next