DataSalon

Discover quality datasets for AI training — aggregated from 40+ platforms, curated by AI.

ProductSearch Datasets Browse Topics Rankings Community API / MCP

ResourcesDocumentation Blog Changelog Status

LegalPrivacy Policy Terms of Service Cookie Policy

NLP & Text Datasets | DataSalon

All Categories

📝

NLP & Text

Text classification, translation, QA, summarization, dialogue, sentiment analysis, language modeling, text corpora

44,451 datasets

NLP & Text

Active Information Records CDA: Colombian Public Document Metadata

REGISTROS ACTIVOS DE INFORMACION CDA is a dataset from www.datos.gov.co, last updated on 2026-05-18. It contains metadata for active information records managed under Colombia's General Archive Law 594 of 2000. The data includes columns such as SERIE DOCUMENTAL, Dependencia responsable, Formato, and Descripción del contenido.

TabularCSVXMLJSONDocument ManagementPublic RecordsColombia+1

0 views

NLP & Text

Older Adults' Satisfaction with Generative AI Conversational Agents

A study dataset examining how task–technology fit and social–technology fit shape older adults’ satisfaction with generative AI-powered conversational agents. The dataset was authored by Mingxi Sun and last updated on 2026-05-26. It is a small dataset of 148.6 KB, stored in an XLSX file.

TabularExcelConversational AgentsGenerative AiOlder AdultsUser SatisfactionSyntheticHuman Computer Interaction+1

0 views

NLP & Text

ACTGOV Urban Open Space Asset Locations and Attributes

A polygon dataset showing land owned or managed by the City and Environment directorate for urban open space in the Australian Capital Territory. It includes attributes such as asset name, suburb, ownership, asset sub-type, and land area, and is maintained by City Services. The dataset was last updated on 2026-04-04.

GeospatialZIPCSVTextExcelNative Grassland SitePedestrian ParklandNeighbourhood ParkGdc0b1360dfOpen SpaceDistrict ParkCommunity Facilities And AssetsLearn To Ride FacilityCedCity And EnvironmentCity ServicesLand ManagementUrban Open SpaceLanewayLand CoverCripParkLand AdministrationCommunity Recreational Irrigated ParkAct GovernmentGeospatial AssetsAsset+1

0 views

NLP & Text

ACT Stormwater Sump Locations with Asset Attributes

Stormwater sumps in the Australian Capital Territory are mapped with attributes like ownership, lid material, and depth. Assets are owned or managed by the City and Environment Directorate and captured via a works-as-executed handover process. The dataset includes 16 distinct asset sub-types, such as Grated Sump and Inspection Pit.

GeospatialZIPCSVTextExcelGdc0b1360dfWater managementCommunity Facilities And AssetsWaterCedSumpStormwater PitsUrban PlanningStormwater InfrastructureAssetStormwater+1

0 views

NLP & Text

Tecci: Tricky Edits of Collected and Curated Images for Instruction Following

TECCI provides 1,934 images paired with 7,550 edit instructions for evaluating multimodal models. The dataset includes two subsets: TECCI-GGIS with 1,404 images and 7,020 automatically generated instructions, and TECCI-IRCS with 530 images and 530 manually written instructions. Created by Google and last updated in May 2026, it is hosted on Hugging Face.

ImageMultimodalComputer VisionInstruction FollowingMultimodal EvaluationSynthetic+1

0 views

NLP & Text

Elective Medical Consultations in Betania, Antioquia from 2013

Elective consultation records from the ESE Hospital San Antonio in Betania, Antioquia, Colombia, starting in 2013. The data excludes first-time general medicine and emergency general medicine visits. It was published by www.datos.gov.co and was last updated in May 2026.

TabularCSVXMLJSONHospital RecordsColombia HealthAdministrative DataHealthcare ServicesElective Consultations+1

0 views

NLP & Text

Plastome and nrITS Sequences for Marsdenieae Tribe Phylogenetic Analysis

20 newly sequenced plastomes and 31 newly sequenced nrITS sequences from the Marsdenieae plant tribe, assembled from Illumina NovaSeq 6000 reads. The dataset includes additional sequences assembled from eight publicly available SRA accessions. Rong Chen published this data on figshare in May 2026 under a CC-BY-4.0 license.

TextPlastome SequencesPlant genomicsHealthcarePhylogenomicsApocynaceaeMarsdenieaeSynthetic+1

0 views

NLP & Text

Devirt Corpus: Obfuscated JavaScript with Deobfuscation Metrics

A corpus of obfuscated JavaScript samples paired with their deobfuscated outputs and readability scores. Each sample includes metrics on input and output size, and the percentage of original code kept after deobfuscation. The dataset is authored by devirt-dev and was last updated on Hugging Face in June 2026.

TabularReadability MetricsCode AnalysisDeobfuscationNatural Language ProcessingJavascriptObfuscationSynthetic+1

0 views

NLP & Text

Economic Reintegration Benefits in Colombia: National and Regional Disbursements

ESTADÍSTICAS DE LOS BENEFICIOS DE INSERCIÓN ECONÓMICA tracks national and regional disbursements of Economic Reintegration Benefits (BIE) in Colombia. The data, sourced from www.datos.gov.co, includes details on benefit types, municipalities, departments, and beneficiary status. The dataset was last updated on 2026-05-18.

TabularCSVXMLJSONGovernment BenefitsColombiaRegional StatisticsEconomic Reintegration+1

0 views

NLP & Text

Environmental Glossary with Terms and Definitions in Spanish

GLOSARIO AMBIENTAL is a Spanish-language glossary of key environmental terms. The dataset is hosted by www.datos.gov.co and was last updated on May 18, 2026. It provides simple and clear explanations for students, professionals, and anyone interested in ecological and conservation topics.

TextCSVXMLJSONSpanish-languageEducationEnvironmental GlossaryTerminology+1

0 views

NLP & Text

Metrolinea Information Assets Registry: Public Data Inventory

METROLÍNEA S.A.'s inventory of public information assets, containing categories, descriptions, and formats. The registry includes columns for the elaboration date (FECHA DE ELABORACIÓN), version (VERSIÓN 1.0.), and the responsible organization. It is hosted by the Colombian open data portal www.datos.gov.co and was last updated on 2026-05-18.

TabularCSVXMLJSONInformation AssetsColombiaMetadata CatalogPublic Administration+1

0 views

NLP & Text

Free Public WiFi Usage and Connectivity Indicators for Risaralda, Colombia

Statistical and geographic data on the usage and operation of free WiFi zones in the Risaralda department. The dataset includes indicators related to connectivity, number of connections, usage time, visited zones, and general service behavior. It is hosted on the Colombian open data portal www.datos.gov.co and was last updated on May 20, 2026.

TabularGeospatialCSVXMLJSONPublic WifiUsage StatisticsConnectivity MetricsMunicipal Data+1

0 views

NLP & Text

Translation Performance Data for 100 EFL Learners in China

106 English as a foreign language learners from diverse disciplines in China participated in a study on task complexity and translation anxiety, with 100 included in final analyses. The dataset contains performance metrics from two written translation tasks at different complexity levels, assessed for process efficiency and product quality. The data was published by Xiangyan Zhou on figshare under a CC-BY-4.0 license.

TabularExcelTask ComplexityTranslation AnxietyTranslation PerformanceEfl Learners+1

0 views

NLP & Text

Public Wi-Fi Zones in San José de Cúcuta Municipality with Location and Speed Data

A dataset of public Wi-Fi access points in the municipality of San José de Cúcuta, Colombia. The data includes geographic coordinates, connection speeds, technology types, and operational schedules. It is hosted on the Colombian open data portal www.datos.gov.co and was last updated in May 2026.

TabularGeospatialCSVXMLJSONGeospatial LocationsPublic WifiMunicipal InfrastructureColombiaConnectivity+1

0 views

NLP & Text

Brokenarxiv Training: Research-Level Mathematical Problems from ArXiv

Training data generated from past ArXiv articles using the BrokenArXiv pipeline. The dataset is intended for training models on research-level mathematical problems and is licensed under cc-by-4.0, though individual rows may have different licenses. The dataset was created by MathArena and was last updated on 2026-06-16.

TextMathematicsNlp TrainingArxivResearch PapersSynthetic+1

0 views

NLP & Text

ArXiv Math Training Data for Research-Level Problem Solving

Generated from past ArXiv articles, this dataset provides training data for research-level mathematical problems. It was created by MathArena and last updated on June 16, 2026. Each row has a different license depending on the source article, which downstream users must respect.

TextMathematicsLanguage Model TrainingAcademic TextResearch PapersSynthetic+1

0 views

NLP & Text

Indic Hplt V2: Multilingual Pretraining Corpus Across 13 Indic Languages

A multilingual pretraining corpus of 34,605,630 documents across 13 Indic languages and English, built from HPLT Monolingual v3 high-quality web crawl data. It is the larger successor to Indic HPLT v1, adding 3 new Indic languages and containing approximately 25.5 billion estimated tokens. The dataset was authored by ashtok897 and last updated on Hugging Face in May 2026.

TextMultilingualWeb CrawlPretraining CorpusNatural Language ProcessingIndic LanguagesMultilingual Text+1

0 views

NLP & Text

BCAI Finance Kor Embedding Pair: Korean Financial Text Pairs for Model Training

45,589 Korean financial text pairs serve as training data for contrastive learning and as a retrieval corpus. The dataset is authored by BCCard and was last updated on June 11, 2026. It likely contains anchor and positive sentence pairs derived from financial documents.

0 views

NLP & Text

MEN_INDICE_PARIDAD_POR_GENERO_MATRICULA

A dataset from Colombia's Ministry of National Education (MEN) containing gender parity indices and enrollment figures for preschool, basic, and secondary education for the year 2020. Data is disaggregated by certified territorial entities (ETCs) and their respective departments. The dataset includes 20 columns covering enrollment counts by gender and education level, as well as calculated parity indices.

TabularCSVXMLJSONEducation StatisticsColombiaGender ParityAdministrative DataEnrollment Data+1

0 views

NLP & Text

ROSAT PSPC Catalog of 203 X-Ray Selected Galaxy Clusters

A catalog of 203 clusters of galaxies serendipitously detected in 647 ROSAT PSPC high Galactic latitude pointings covering 158 square degrees. This database was created by the NASA HEASARC in December 2001 based on the CDS/ADC catalog J/ApJ/502/558/. The catalog lists X-ray fluxes, core radii, and spectroscopic redshifts for 73 clusters and photometric redshifts for the remainder.

TabularX-RayAstronomyRosatGalaxy ClustersAstrophysics+1

0 views

PreviousPage 268 of 2218Next