Loading...
Loading...
Text classification, translation, QA, summarization, dialogue, sentiment analysis, language modeling, text corpora
44,808 datasets
Replication data for the paper 'Guarding economic interests abroad: FDI, political instability, and the proliferation of Chinese police training.' The package includes a dataset in Stata format, a Stata do-file for statistical analyses, and supplementary materials with figures and tables. The data was authored by Sae-Phoo, Lin and is hosted by Harvard Dataverse.
A collection of code and data for reproducing results from the paper 'Molecular Simulations Assisted by an Artificial Intelligence Agent (ArIA)'. The dataset includes directories for model development, prompt generation, and application deployment. It was authored by Supphachok Chanmungkalakul and last updated on 2026-05-18.
Geoscience Australia Data provides a 2026 report detailing a plane table and theodolite survey of the abandoned Coronet Hills copper mine in the Northern Territory. The document describes sulphide-bearing lodes mineralized with copper, lead, and arsenic, and includes assay results from dumps and underground workings. It concludes with proposed locations for six diamond drill holes to test extensions of the lodes.
128.9 MB of simulation data from a study comparing the Dual Twist Channel Angular Extrusion (DTCAE) process to Equal Channel Angular Pressing (ECAP). The dataset includes outputs from 3D Finite Element Method simulations run in DEFORM-3D, analyzing plastic deformation and strain distribution. It was authored by Vikash Ranjan and uploaded in April 2026.
A 2026 case study from Lwamondo village, South Africa, investigates antimicrobial resistance in E. coli using a One Health approach. The research analyzes 47 paired stool and soil samples, yielding 117 and 94 E. coli isolates respectively, with phenotypic and genotypic resistance testing. Authored by Solanka Ellen Ledwaba, the dataset is a published PDF report.
Colombian data tracks the use of energy tariff subsidies in non-interconnected zones (ZNI). The dataset includes company-level details on fuel purchases, quarterly spending, and subsidy amounts allocated to different socioeconomic strata. It is published by datos.gov.co and was last updated on 2026-05-18.
A 1961 geological mapping program by the Bureau of Mineral Resources' Great Artesian Basin Party produced this dataset. It covers the Julia Creek area, forming the western and northern margins of the Eromanga Sub-Basin in Western Queensland. The data describes Cretaceous rocks overlying a crystalline basement, with small outcrops of Precambrian granite and metamorphics in the southwest, and areas masked by Cainozoic and recent deposits.
Over 1,495 parks and public spaces across Montreal's boroughs, covering more than 6,412 hectares. The dataset provides surface polygon representations for these areas within the urban fabric. Data is for representational purposes and is not a legal reference for park boundaries.
Ground-Based Doppler Orbitography by Radiopositioning Integrated on Satellite (DORIS) IDS Station Coordinates Product from NASA CDDIS provides station position time series in STCD format. The dataset is derived from DORIS data analysis by International DORIS Service (IDS) centers and is hosted by the National Aeronautics and Space Administration. One platform indicates a last update date of March 13, 2026.
The IfGPT Dataset is developed within the project IfGPT: Infrastructure for Fine-tuning Pre-trained Large Language Models. It aims to establish a freely accessible infrastructure for the selection and pre-processing of large datasets for Bulgarian as well as tailored data for particular industries. The dataset is authored by DCL-IBL and was last updated on Hugging Face in June 2026.
IfGPT is a dataset developed to establish a freely accessible infrastructure for fine-tuning large language models for Bulgarian. The project aims to provide tailored data for specific industries and purposes. It was created by DCL-IBL and was last updated on June 3, 2026.
A parallel dataset for Rakhine and Standard Burmese (Myanmar) language processing. The dataset was created by the author 'rakhine-nlp' and was last updated on the platform in June 2026. It is intended for machine translation, language modeling, and dialect analysis.
A basic collection of Karachay words and phrases intended for training and fine-tuning language models for the Turkic language group, specifically the Karachay-Balkar language. The dataset is hosted on Hugging Face by author 'thetemirbolatov' and was last updated on 2026-05-27. Its size category suggests it likely contains between 10,000 and 100,000 entries.
7 years of follow-up data from the National Health and Aging Trends Study (NHATS) analyzes 480 community-dwelling older adults. The dataset, created by Jianhui Pan, links objective wrist-worn accelerometry metrics to long-term trajectories of functional disability.
480 community-dwelling older adults from the National Health and Aging Trends Study were monitored for 7 years using wrist-worn accelerometers to link objective activity patterns with functional disability trajectories. The dataset, created by Jianhui Pan and published in 2026, includes weighted data representing a population of 1.9 million.
20,085 true/false and 18,262 multiple-choice questions automatically generated from daily news headlines. The dataset, created by agentic-learning-ai-lab, spans from January 1, 2020, to May 26, 2026, and is designed to evaluate how large language models' prescient capabilities evolve over time.
Datos.gov.co hosts public georeferenced data on users connected to various Wifi Zones in the Municipality of Tunja, Boyacá, reported by Primary Data Generating Units (UPGD). The dataset includes columns for ZONA, SECTOR, ID, FECHA Y HORA, ANIO, FECHA, LATITUD, HORA, and LONGITUD. It was last updated on 2026-05-18.
South-east South Australia's stranded coastal barriers preserve a record of sea-level variations over the past 800,000 years. This dataset presents new single-aliquot regenerative-dose optically stimulated luminescence (SAR-OSL) ages for quartz extracts from these dunes, extending the tested age range to 0-250 ka. The data, sourced from Geoscience Australia, compares these ages with an existing independent chronology to validate the SAR-OSL dating method.
Voyager 2 Plasma Spectrometer (PLS) data from the July 1979 Jupiter flyby. The dataset contains high-energy-resolution current ion spectra for protons across 128 logarithmic energy channels from 10 eV to 5950 eV, measured in femto-amperes. It was produced by NASA, with instrument details described in a 1977 Space Science Review reference.
Voyager 1's Low Energy Charged Particle experiment data collected in the vicinity of Jupiter. The dataset includes 48.0-second rate and flux measurements for electrons and ions across almost 100 instrument channels, with particles including protons, alpha particles, and light to heavy nuclei. NASA produced this globally calibrated dataset, last updated on the platform in April 2026.