Loading...
Loading...
Text classification, translation, QA, summarization, dialogue, sentiment analysis, language modeling, text corpora
44,451 datasets
REGISTROS ACTIVOS DE INFORMACION CDA is a dataset from www.datos.gov.co, last updated on 2026-05-18. It contains metadata for active information records managed under Colombia's General Archive Law 594 of 2000. The data includes columns such as SERIE DOCUMENTAL, Dependencia responsable, Formato, and Descripción del contenido.
A study dataset examining how task–technology fit and social–technology fit shape older adults’ satisfaction with generative AI-powered conversational agents. The dataset was authored by Mingxi Sun and last updated on 2026-05-26. It is a small dataset of 148.6 KB, stored in an XLSX file.
A polygon dataset showing land owned or managed by the City and Environment directorate for urban open space in the Australian Capital Territory. It includes attributes such as asset name, suburb, ownership, asset sub-type, and land area, and is maintained by City Services. The dataset was last updated on 2026-04-04.
Stormwater sumps in the Australian Capital Territory are mapped with attributes like ownership, lid material, and depth. Assets are owned or managed by the City and Environment Directorate and captured via a works-as-executed handover process. The dataset includes 16 distinct asset sub-types, such as Grated Sump and Inspection Pit.
TECCI provides 1,934 images paired with 7,550 edit instructions for evaluating multimodal models. The dataset includes two subsets: TECCI-GGIS with 1,404 images and 7,020 automatically generated instructions, and TECCI-IRCS with 530 images and 530 manually written instructions. Created by Google and last updated in May 2026, it is hosted on Hugging Face.
Elective consultation records from the ESE Hospital San Antonio in Betania, Antioquia, Colombia, starting in 2013. The data excludes first-time general medicine and emergency general medicine visits. It was published by www.datos.gov.co and was last updated in May 2026.
20 newly sequenced plastomes and 31 newly sequenced nrITS sequences from the Marsdenieae plant tribe, assembled from Illumina NovaSeq 6000 reads. The dataset includes additional sequences assembled from eight publicly available SRA accessions. Rong Chen published this data on figshare in May 2026 under a CC-BY-4.0 license.
A corpus of obfuscated JavaScript samples paired with their deobfuscated outputs and readability scores. Each sample includes metrics on input and output size, and the percentage of original code kept after deobfuscation. The dataset is authored by devirt-dev and was last updated on Hugging Face in June 2026.
ESTADÍSTICAS DE LOS BENEFICIOS DE INSERCIÓN ECONÓMICA tracks national and regional disbursements of Economic Reintegration Benefits (BIE) in Colombia. The data, sourced from www.datos.gov.co, includes details on benefit types, municipalities, departments, and beneficiary status. The dataset was last updated on 2026-05-18.
GLOSARIO AMBIENTAL is a Spanish-language glossary of key environmental terms. The dataset is hosted by www.datos.gov.co and was last updated on May 18, 2026. It provides simple and clear explanations for students, professionals, and anyone interested in ecological and conservation topics.
METROLÍNEA S.A.'s inventory of public information assets, containing categories, descriptions, and formats. The registry includes columns for the elaboration date (FECHA DE ELABORACIÓN), version (VERSIÓN 1.0.), and the responsible organization. It is hosted by the Colombian open data portal www.datos.gov.co and was last updated on 2026-05-18.
Statistical and geographic data on the usage and operation of free WiFi zones in the Risaralda department. The dataset includes indicators related to connectivity, number of connections, usage time, visited zones, and general service behavior. It is hosted on the Colombian open data portal www.datos.gov.co and was last updated on May 20, 2026.
106 English as a foreign language learners from diverse disciplines in China participated in a study on task complexity and translation anxiety, with 100 included in final analyses. The dataset contains performance metrics from two written translation tasks at different complexity levels, assessed for process efficiency and product quality. The data was published by Xiangyan Zhou on figshare under a CC-BY-4.0 license.
A dataset of public Wi-Fi access points in the municipality of San José de Cúcuta, Colombia. The data includes geographic coordinates, connection speeds, technology types, and operational schedules. It is hosted on the Colombian open data portal www.datos.gov.co and was last updated in May 2026.
Training data generated from past ArXiv articles using the BrokenArXiv pipeline. The dataset is intended for training models on research-level mathematical problems and is licensed under cc-by-4.0, though individual rows may have different licenses. The dataset was created by MathArena and was last updated on 2026-06-16.
Generated from past ArXiv articles, this dataset provides training data for research-level mathematical problems. It was created by MathArena and last updated on June 16, 2026. Each row has a different license depending on the source article, which downstream users must respect.
A multilingual pretraining corpus of 34,605,630 documents across 13 Indic languages and English, built from HPLT Monolingual v3 high-quality web crawl data. It is the larger successor to Indic HPLT v1, adding 3 new Indic languages and containing approximately 25.5 billion estimated tokens. The dataset was authored by ashtok897 and last updated on Hugging Face in May 2026.
45,589 Korean financial text pairs serve as training data for contrastive learning and as a retrieval corpus. The dataset is authored by BCCard and was last updated on June 11, 2026. It likely contains anchor and positive sentence pairs derived from financial documents.
A dataset from Colombia's Ministry of National Education (MEN) containing gender parity indices and enrollment figures for preschool, basic, and secondary education for the year 2020. Data is disaggregated by certified territorial entities (ETCs) and their respective departments. The dataset includes 20 columns covering enrollment counts by gender and education level, as well as calculated parity indices.
A catalog of 203 clusters of galaxies serendipitously detected in 647 ROSAT PSPC high Galactic latitude pointings covering 158 square degrees. This database was created by the NASA HEASARC in December 2001 based on the CDS/ADC catalog J/ApJ/502/558/. The catalog lists X-ray fluxes, core radii, and spectroscopic redshifts for 73 clusters and photometric redshifts for the remainder.