Loading...
Loading...
Text classification, translation, QA, summarization, dialogue, sentiment analysis, language modeling, text corpora
44,454 datasets
An inventory of public information generated, obtained, acquired, or controlled by the Mayor's Office of Castilla La Nueva, Colombia, that has been classified as confidential or reserved. The dataset includes 19 columns detailing the legal basis, responsible parties, document series, and classification dates. It is hosted on the Colombian open data portal www.datos.gov.co and was last updated on 2026-05-18.
A 2020 map from the Climate Impact Atlas identifies neighborhoods in Zuid-Holland province that are at least two degrees warmer due to the urban heat island effect. This dataset shows the percentage of vulnerable families with children—defined as low-educated, low-income, or unemployed—living in those warmest areas. It was published by the Dutch Ministry of the Interior and Kingdom Relations under a CC-PDM-1.0 license.
EvalSTT is a public evaluation corpus for speech-to-text models focusing on French administrative language. Created by the French government's DINUM AI department, it contains official speeches, public addresses, and parliamentary questions. The dataset is published for transparency to document and reproduce the government's model evaluation benchmarks.
Índice de Información Clasificada y Reservada de la Procuraduría General de la Nación is an inventory of information generated, obtained, acquired, or controlled by Colombia's Attorney General's Office that has been classified as confidential or reserved under the legal framework. The dataset includes 22 columns detailing the legal basis, responsible departments, storage format, and classification terms for each record. It is hosted on the Colombian open data portal, datos.gov.co, and was last updated in May 2026.
A publication schema from the General Comptroller's Office of the Municipality of Manizales, Colombia, detailing its proactive information disclosure. The dataset includes 11 columns describing information titles, responsible parties, formats, and publication logistics. It was last updated on 2026-05-18.
138.5 MB of original TCGA LGG source files used to build a multi-omics relational database. The unmodified TXT files include clinical information, survival outcomes, mutation data, copy number alterations, and mRNA expression data. Author Aaliah Aly uploaded these files to figshare in May 2026 to support transparency and reproducibility.
ÍNDICE DE INFORMACIÓN CLASIFICADA Y RESERVADA is an inventory of public information generated, obtained, acquired, or controlled by obligated entities that has been classified as confidential or reserved. The dataset is published by www.datos.gov.co and was last updated on 2026-05-18. It includes columns detailing the classification, legal justification, responsible parties, and publication status of the information.
An inventory of public information generated, obtained, acquired, or controlled by obligated entities in Colombia that has been classified as confidential or reserved. The dataset includes 20 columns detailing the title, description, legal justification, classification date, and responsible entity for each record. It is published by www.datos.gov.co and was last updated on 2026-05-18.
GPM Ground Validation NOAA Parsivel MC3E V1 contains processed meteorological data from a ground-based disdrometer. Collected during the Midlatitude Continental Convective Clouds Experiment in central Oklahoma, the dataset includes 1-minute resolution moment data and raindrop number concentration estimates from April 5 to June 6, 2011. It was produced by the GHRC DAAC to provide reference reflectivity for calibrating an S-band profiler.
Evaluation reports for Global Affairs Canada's priorities, programs, and projects in Colombia. The reports serve as a management tool for reviewing program performance and improving future design and implementation. The dataset consists of individual HTML reports generated from periodic evaluations.
Australian bathymetry data collected by Geoscience Australia and other agencies. The dataset combines measurements from satellite altimetry, singlebeam echosounders, multibeam echosounders, and airborne laser systems (LADS). It was last updated on 2026-05-05.
A report generated from a periodic evaluation of Global Affairs Canada's priorities, programs, and projects. The evaluation serves as a management tool for reviewing program performance, with gathered information intended to improve the design and implementation of upcoming initiatives. The report is published by Global Affairs Canada and was last updated on 2026-05-28.
An inventory of public information generated, obtained, acquired, or controlled by the Municipality of Fusagasugá, Colombia, that has been classified as confidential or reserved under Law 1712 of 2014. The dataset is structured using a template from MINTIC and was last updated on May 18, 2026. It is published by www.datos.gov.co.
A supplementary file from a study evaluating a multi-party conversational system for social robots. The system, implemented on a Furhat robot, combines multimodal perception with a large language model and was tested with 30 participants across two interaction scenarios. The PDF document reports results including addressee accuracy and face recognition reliability from experiments conducted by author Giulio Antonio Abbo.
OPUS Neapolitan Translations provides nearly 1 million parallel translation examples across Italian, English, and Neapolitan. The dataset was created by author Gdacciaro, starting from an OPUS English-Italian parallel corpus and generating Neapolitan translations using a translation model. It was last updated on June 14, 2026.
4.5 MB of data files, R scripts, and HTML files from a study on numeral acquisition in Dutch kindergartners with and without suspected Developmental Language Disorder (DLD). The collection includes CSV files for tasks like Rote Counting, Tell Me, and Give Me, with scores, accuracy, and response categorizations. The dataset was authored by H.M. de Vries and last updated on April 9, 2026.
VSTAT is a video-based benchmark for evaluating the visual state tracking capability of Multimodal Large Language Models (MLLMs). It contains 834 video clips paired with 1,500 questions whose answers cannot be inferred from any single keyframe or short segment. The dataset was created by nyu-visionx and was last updated in June 2026.
A geospatial dataset provides a simplified representation of the Braunschweig urban area and its surroundings. The data is provided by the City of Braunschweig under the Data License Germany - Attribution - Version 2.0. The dataset is aggregated by the Bundesamt für Kartographie und Geodäsie.
A dataset from figshare authored by Laura M. Vowels, last updated on 2026-04 27. It contains results from Study 2, which examined participants' perceptions of large language model (LLM)-generated responses for psychosocial risk assessment. The 9.5 KB Excel file likely contains ratings on accuracy, empathy, and clinical usefulness across risk domains like suicide, intimate partner violence, and substance misuse.
Global Affairs Canada periodically conducts evaluations of its priorities, programs, and projects. These evaluation reports serve as a practical management tool for reviewing program performance and improving future program design and implementation. The reports are published by Global Affairs Canada and were last updated in May 2026.