Loading...
Loading...
Text classification, translation, QA, summarization, dialogue, sentiment analysis, language modeling, text corpora
44,729 datasets
A municipal registry of informal vendors in Chía, Colombia, with data last updated on 2026-05-18. The dataset includes demographic and social vulnerability indicators such as age, gender, and disability status. It originates from the Colombian open data platform www.datos.gov.co.
Matriz Activos de Información Ministerio del Interior is an inventory of information generated, obtained, acquired, or transformed by the obligated entity. The dataset is hosted on the Colombian open data portal, datos.gov.co, and was last updated on 2026-05-18. It includes columns describing document series, formats, availability, and administrative dependencies.
Geological data for the Davenport province in central Australia, situated between the Tennant Creek and Arunta Inlier regions. The description details sedimentary, volcanic, and intrusive rocks, including the 1870 Ma Warramunga Group and the at least 10 km thick Hatches Creek Group. The dataset is provided by the Australian Ocean Data Network and was last updated in April 2026.
A central hub for training logs, configurations, and evaluation results from the Language Decoded project. The project originated from Cohere's Tiny Aya Expedition hackathon in March 2026 and was extended into Phase 3 for an accompanying paper submitted on 2026-05-26. The dataset serves as a record of experiments exploring the impact of fine-tuning a multilingual model on native-language code.
Vox Classica is a Latin speech corpus of approximately 73 hours of audio, segmented into short clips by sentence. It is a large-scale, ML-ready dataset of human-read Classical Latin designed to address the absence of a publicly available corpus large enough for model training. The dataset was curated by Kaiyuan Zhao and published by Ken-Z.
Weddell seal (Leptonychotes weddellii) and leopard seal (Hydrurga leptonyx) vocalizations were collected via passive acoustic monitoring 5.6 km seaward from Davis Station, East Antarctica. Eight-minute recordings were manually sampled hourly over 24 hours every 10 days from 24 July 2021 to 30 January 2022. The dataset likely contains daily call counts, showing seasonal and diel patterns influenced by ice cover and seal behavior.
Grant recipients and allocations under the Community Policing Partnership Program from 2016/17 until its end in March 2022. The program assisted municipalities and First Nations with salary-related costs for hiring additional police officers. The dataset is provided by the Government of Ontario.
A geospatial dataset from the French Bureau de Recherches Géologiques et Minières (BRGM) detailing urban planning zones and prescriptions for the area preceding the Ramerupt development. It likely contains zoning classifications and land-use prescriptions. The authoritative documents are available for consultation at local town halls and competent administrative offices.
Virtual training records from Colombia's National Archive detail course participation from 2021 to 2023. The data includes enrollment numbers, attendance, course types, and delivery platforms. It is published by the Colombian government's open data portal, datos.gov.co.
A geospatial dataset detailing urban planning zones and land-use prescriptions for the commune of Le Mériot, France. The data is provided by the Bureau de Recherches Géologiques et Minières (BRGM) via a Web Map Service (WMS). The documents posted online are described as informative only, with legally enforceable versions available for consultation at local government offices.
Fontaine Mâcon's urban planning data, likely detailing land-use prescriptions and zoning regulations. The dataset originates from the Bureau de Recherches Géologiques et Minières (BRGM) and is served via a Web Map Service (WMS). The authoritative versions of the documents are available for consultation at local town halls and competent public establishment offices.
A WMS service provides access to local urban planning documents and zoning information for the area before Ramerupt. The data originates from the Bureau de Recherches Géologiques et Minières (BRGM) and is hosted on the EU Open Data platform. The online documents are described as informative only, with legally enforceable versions available for consultation at local town halls, EPCI head offices, or DDT offices.
KletterMix is a large German-language text dataset released as sharded JSONL files. The full release combines deduplicated data with remaining scored examples, superseding a smaller review-time subset. It was created by AIML-TUDA and was last updated on June 4, 2026.
A multi-task dataset of small molecules filtered from ChEMBL Release 36 for predicting ADMET properties. It contains 55,552 training rows and 14,928 test rows, split into structurally distinct clusters to create a hard scaffold-diverse test set. The dataset was created by Aarush Garg for the MolTuner framework and was last updated in May 2026.
Geoscience Australia contributes to managing 58 Commonwealth marine parks covering 40% of Australia's EEZ by providing new marine data. This release contains a 30-meter bathymetry grid for the Joseph Bonaparte Gulf area, processed using a semi-hierarchical seafloor classification scheme that categorizes slopes into Plains, Slopes, and Escarpments. The data supports the development of 'eco-narrative' documents for marine park management.
Geomorphic features of the seabed within Australia's marine jurisdiction, including its Exclusive Economic Zone and offshore territories. The dataset was produced by Geoscience Australia and aggregated by the Australian Ocean Data Network. Features were mapped using the best available bathymetric data at a scale of 1:5,000,000.
Geoscience Australia's GA310 South West Margin 2D MSS survey acquired gravity line data in 2008/2009 as part of the Offshore Energy Security Program. Gravity data measures changes in rock density beneath the Earth's surface, processed via standard methods and quality-checked by GA geophysicists. The survey acquired 26,000 line-kilometres of gravity and magnetic data.
March 24 to April 5, 2022, the Minderoo-UWA Deep-Sea Research Centre collected bathymetric data in the South-west Corner and Perth Canyon Marine Parks using a Kongsberg EM304 multibeam sonar aboard the MV Pangaea Ocean Explorer. The processed data comprises 64m-resolution and 128m-resolution 32-bit floating point GeoTIFF files, created with QPS Qimera software. This dataset is published with permission from Geoscience Australia and is not intended for navigational use.
From 2018 to October 2024, this dataset tracks vaccine doses administered and coverage rates relative to target populations across municipalities in Colombia's Atlántico department. It includes counts for 11 specific vaccines such as HEPATITIS A, VARICELA, and PENTAVALENTE, alongside municipal population figures. The data is hosted by www.datos.gov.co on the Socrata platform.
Data on waste tonnage generated in the municipality of Roldanillo, Valle del Cauca, Colombia, sourced from www.datos.gov.co. The dataset includes columns for time periods and categorizations of waste, such as non-recoverable and recoverable materials. It was last updated on 2026-05-18.