DataSalon

Discover quality datasets for AI training — aggregated from 40+ platforms, curated by AI.

ProductSearch Datasets Browse Topics Rankings Community API / MCP

ResourcesDocumentation Blog Changelog Status

LegalPrivacy Policy Terms of Service Cookie Policy

NLP & Text Datasets | DataSalon

All Categories

📝

NLP & Text

Text classification, translation, QA, summarization, dialogue, sentiment analysis, language modeling, text corpora

44,454 datasets

NLP & Text

Castilla La Nueva Index of Classified and Reserved Public Information

An inventory of public information generated, obtained, acquired, or controlled by the Mayor's Office of Castilla La Nueva, Colombia, that has been classified as confidential or reserved. The dataset includes 19 columns detailing the legal basis, responsible parties, document series, and classification dates. It is hosted on the Colombian open data portal www.datos.gov.co and was last updated on 2026-05-18.

TabularCSVXMLJSONGovernment TransparencyColombiaPublic AdministrationDocument Classification+1

0 views

NLP & Text

Vulnerable Families in Warmest Neighborhoods of Zuid-Holland Province (2020)

A 2020 map from the Climate Impact Atlas identifies neighborhoods in Zuid-Holland province that are at least two degrees warmer due to the urban heat island effect. This dataset shows the percentage of vulnerable families with children—defined as low-educated, low-income, or unemployed—living in those warmest areas. It was published by the Dutch Ministry of the Interior and Kingdom Relations under a CC-PDM-1.0 license.

GeospatialClimate VulnerabilityUrban Heat IslandGeospatial AnalysisSocioeconomic Indicators+1

0 views

NLP & Text

EvalSTT: French Government Speech Corpus for Speech-to-Text Model Evaluation

EvalSTT is a public evaluation corpus for speech-to-text models focusing on French administrative language. Created by the French government's DINUM AI department, it contains official speeches, public addresses, and parliamentary questions. The dataset is published for transparency to document and reproduce the government's model evaluation benchmarks.

TextAudioSpeech To TextFrench LanguageEvaluation CorpusNatural Language ProcessingGovernment SpeechNlp Benchmark+1

0 views

NLP & Text

Colombian Attorney General's Index of Classified and Reserved Information

Índice de Información Clasificada y Reservada de la Procuraduría General de la Nación is an inventory of information generated, obtained, acquired, or controlled by Colombia's Attorney General's Office that has been classified as confidential or reserved under the legal framework. The dataset includes 22 columns detailing the legal basis, responsible departments, storage format, and classification terms for each record. It is hosted on the Colombian open data portal, datos.gov.co, and was last updated in May 2026.

TabularCSVXMLJSONGovernment TransparencyLegal ClassificationDocument InventoryColombia+1

0 views

NLP & Text

Manizales Municipal Comptroller's Office Publication Schema

A publication schema from the General Comptroller's Office of the Municipality of Manizales, Colombia, detailing its proactive information disclosure. The dataset includes 11 columns describing information titles, responsible parties, formats, and publication logistics. It was last updated on 2026-05-18.

TabularCSVXMLJSONInformation ManagementGovernment TransparencyPublic RecordsColombiaMunicipal Data+1

0 views

NLP & Text

TCGA LGG: Original Source Files for Multi-Omics Database Construction

138.5 MB of original TCGA LGG source files used to build a multi-omics relational database. The unmodified TXT files include clinical information, survival outcomes, mutation data, copy number alterations, and mRNA expression data. Author Aaliah Aly uploaded these files to figshare in May 2026 to support transparency and reproducibility.

TextTabularMulti OmicsTcgaHealthcareGenomicsClinical DataLgg+1

0 views

NLP & Text

Index of Classified and Reserved Public Information Records

ÍNDICE DE INFORMACIÓN CLASIFICADA Y RESERVADA is an inventory of public information generated, obtained, acquired, or controlled by obligated entities that has been classified as confidential or reserved. The dataset is published by www.datos.gov.co and was last updated on 2026-05-18. It includes columns detailing the classification, legal justification, responsible parties, and publication status of the information.

TabularCSVXMLJSONGovernment TransparencyPublic RecordsColombiaInformation ClassificationLegal Compliance+1

0 views

NLP & Text

Index of Classified and Reserved Information from Colombian Public Entities

An inventory of public information generated, obtained, acquired, or controlled by obligated entities in Colombia that has been classified as confidential or reserved. The dataset includes 20 columns detailing the title, description, legal justification, classification date, and responsible entity for each record. It is published by www.datos.gov.co and was last updated on 2026-05-18.

TabularCSVXMLJSONGovernment TransparencyPublic RecordsInformation ClassificationLegal Compliance+1

0 views

NLP & Text

GPM Ground Validation Parsivel Raindrop Data from MC3E

GPM Ground Validation NOAA Parsivel MC3E V1 contains processed meteorological data from a ground-based disdrometer. Collected during the Midlatitude Continental Convective Clouds Experiment in central Oklahoma, the dataset includes 1-minute resolution moment data and raindrop number concentration estimates from April 5 to June 6, 2011. It was produced by the GHRC DAAC to provide reference reflectivity for calibrating an S-band profiler.

TabularTime SeriesRaindrop SizeAtmospheric ScienceHealthcareGround ValidationPrecipitationSynthetic+1

0 views

NLP & Text

Colombia Country Program Evaluation Reports by Global Affairs Canada

Evaluation reports for Global Affairs Canada's priorities, programs, and projects in Colombia. The reports serve as a management tool for reviewing program performance and improving future design and implementation. The dataset consists of individual HTML reports generated from periodic evaluations.

TextColombiaBenchmarkInternational DevelopmentProgram EvaluationGovernment ReportsSynthetic+1

0 views

NLP & Text

Australian Marine Bathymetry Collection from Multiple Sources

Australian bathymetry data collected by Geoscience Australia and other agencies. The dataset combines measurements from satellite altimetry, singlebeam echosounders, multibeam echosounders, and airborne laser systems (LADS). It was last updated on 2026-05-05.

GeospatialOcean DepthMarine Bathymetry+1

1 views

NLP & Text

Formative Evaluation of the Partnership for Gender Equality by Global Affairs Canada

A report generated from a periodic evaluation of Global Affairs Canada's priorities, programs, and projects. The evaluation serves as a management tool for reviewing program performance, with gathered information intended to improve the design and implementation of upcoming initiatives. The report is published by Global Affairs Canada and was last updated on 2026-05-28.

Text🇨🇦 CanadaBenchmarkProgram EvaluationGovernment PolicyGender EqualitySynthetic+1

0 views

NLP & Text

Fusagasugá Municipality Index of Classified and Reserved Information

An inventory of public information generated, obtained, acquired, or controlled by the Municipality of Fusagasugá, Colombia, that has been classified as confidential or reserved under Law 1712 of 2014. The dataset is structured using a template from MINTIC and was last updated on May 18, 2026. It is published by www.datos.gov.co.

TabularCSVXMLJSONGovernment TransparencyPublic RecordsColombiaDocument Classification+1

0 views

NLP & Text

Multi-Party Open-Ended Conversation Data from a Social Robot System

A supplementary file from a study evaluating a multi-party conversational system for social robots. The system, implemented on a Furhat robot, combines multimodal perception with a large language model and was tested with 30 participants across two interaction scenarios. The PDF document reports results including addressee accuracy and face recognition reliability from experiments conducted by author Giulio Antonio Abbo.

AudioMultimodalMultimodal PerceptionSocial RoboticsConversational Ai+1

0 views

NLP & Text

OPUS Neapolitan Translations: Nearly 1 Million Italian-English-Neapolitan Sentence Pairs

OPUS Neapolitan Translations provides nearly 1 million parallel translation examples across Italian, English, and Neapolitan. The dataset was created by author Gdacciaro, starting from an OPUS English-Italian parallel corpus and generating Neapolitan translations using a translation model. It was last updated on June 14, 2026.

TextMachine TranslationNeapolitan LanguageItalian LanguageEnglish LanguageLarge ScaleNatural Language ProcessingParallel CorpusSynthetic+1

0 views

NLP & Text

Dutch Kindergarteners' Numeral Acquisition and Morphosyntactic Cues with DLD

4.5 MB of data files, R scripts, and HTML files from a study on numeral acquisition in Dutch kindergartners with and without suspected Developmental Language Disorder (DLD). The collection includes CSV files for tasks like Rote Counting, Tell Me, and Give Me, with scores, accuracy, and response categorizations. The dataset was authored by H.M. de Vries and last updated on April 9, 2026.

TabularCSVExcelPsycholinguisticsNumeracyDevelopmental DisordersLanguage acquisitionDutch+1

0 views

NLP & Text

VSTAT: Visual State Tracking Benchmark for MLLMs

VSTAT is a video-based benchmark for evaluating the visual state tracking capability of Multimodal Large Language Models (MLLMs). It contains 834 video clips paired with 1,500 questions whose answers cannot be inferred from any single keyframe or short segment. The dataset was created by nyu-visionx and was last updated in June 2026.

VideoMultimodalMllm EvaluationMultimodal AiBenchmarkVideo BenchmarkSynthetic+1

0 views

NLP & Text

Braunschweig Regional Map at 1:100,000 Scale

A geospatial dataset provides a simplified representation of the Braunschweig urban area and its surroundings. The data is provided by the City of Braunschweig under the Data License Germany - Attribution - Version 2.0. The dataset is aggregated by the Bundesamt für Kartographie und Geodäsie.

GeospatialRegional MapRoad NetworkUrban Planning+1

0 views

NLP & Text

Study 2 Results: Participant Perceptions of LLM-Generated Psychosocial Risk Responses

A dataset from figshare authored by Laura M. Vowels, last updated on 2026-04 27. It contains results from Study 2, which examined participants' perceptions of large language model (LLM)-generated responses for psychosocial risk assessment. The 9.5 KB Excel file likely contains ratings on accuracy, empathy, and clinical usefulness across risk domains like suicide, intimate partner violence, and substance misuse.

TabularExcelMental HealthClinical NlpBenchmarkLlm EvaluationHealthcarePsychosocial RiskSynthetic+1

0 views

NLP & Text

Global Affairs Canada Program Evaluation Reports

Global Affairs Canada periodically conducts evaluations of its priorities, programs, and projects. These evaluation reports serve as a practical management tool for reviewing program performance and improving future program design and implementation. The reports are published by Global Affairs Canada and were last updated in May 2026.

Text🇨🇦 CanadaBenchmarkPolicy AnalysisProgram EvaluationSynthetic+1

0 views

PreviousPage 271 of 2218Next