Loading...
Loading...
Text classification, translation, QA, summarization, dialogue, sentiment analysis, language modeling, text corpora
44,558 datasets
Fen Lou provides GROMACS input files for a 100-nanosecond molecular dynamics simulation of the Harpagide ligand binding to the MMP9 protein. The 170.2 MB collection includes topology, parameter, coordinate, and analysis script files sufficient to reproduce the full simulation. The dataset was published on the figshare platform in April 2026.
A 545-meter seismic refraction survey was conducted by the Bureau of Mineral Resources in 1959 for the Irrigation and Water Supply Commission. The survey aimed to locate subsurface river channels and aquifers in the Albert River alluvial flats near Beaudesert, Queensland. The record describes the survey and its results, including the identification of the water table and specific positions interpreted as potential channel locations.
Situs and mailing addresses for Cook County parcels, used by the Assessor's office to mail assessment notices. The dataset includes both owner and taxpayer information and was updated in March 2026. Data is updated monthly and maintained by the Cook County Assessor's office.
Per-question data collected using the google/gemma-3-12b-it model on the reward_bench dataset. The dataset contains structured outputs including reward scores and model completions generated across a range of temperatures. It was authored by 'wtd' and last updated on 2026-05-23.
An inventory of public information assets generated, obtained, acquired, transformed, or controlled by Colombia's National Planning Department (DNP). The dataset includes metadata such as content description, format, responsible entity, and publication status. It was last updated on 2026-05-18 and is available via the datos.gov.co platform.
Geoscience Australia Data describes a probable salt dome in the Woolnough Hills area of the Canning Basin. The description details a dome of Cretaceous and probable Permian sediments, about 2 miles across, with concentric cuestas and a central mound of brecciated dolomite. The dataset was last updated on 2026-05-10.
Bo Zhou's dataset on figshare, last updated April 28, 2026, describes a novel antimalarial chemotype. The data likely contains results from in vivo mouse studies and in vitro resistance screens for compound 10b, an orally efficacious imidazo[4,5-c]pyridine-6-carboxamide. The dataset is 4.4 KB in size and is shared under a CC-BY-NC-4.0 license.
Projected coverage data for the School Feeding Program (PAE) in the Norte de Santander department of Colombia for the year 2022. The projection is based on student enrollment figures as of February 15, 2022, and includes details on educational institutions, enrolled students, and assigned meal quotas. The dataset is hosted on the Colombian open data platform www.datos.gov.co and was last updated in May 2026.
NASA's NAMMA LARGE dataset contains in situ aerosol measurements from a 2006 field campaign based in the Cape Verde Islands. Data from condensation nuclei counters, optical particle spectrometers, an aerodynamic particle sizer, and integrating nephelometers quantify aerosol number density, size distribution, and scattering coefficients. The campaign aimed to characterize African Easterly Waves and Mesoscale Convective Systems and their impacts on regional water and energy budgets.
abdelhaqueidali's Classical Poetry Dataset contains over 400 scraped titles, YouTube video links, and descriptions featuring lyrics for classical Southern Amazigh (Tashelhit) music. The collection serves as a text corpus for regional linguistic study, poetic analysis, and NLP tasks. The dataset was last updated on 2026-06-11.
BhashaBench-Multi is a large-scale multilingual multiple-choice question answering benchmark covering four specialised Indian knowledge domains across 22 Indian languages plus English. It was created by bharatgenai and last updated on 2026-06-09. The dataset is designed to evaluate language understanding and domain knowledge.
Eight bi-triazine cross-linkers with varied bond length, angle, aromaticity, and symmetry were used to modulate peptide conformation and biological activity. The dataset likely contains results from binding assays, cell studies, and in vivo PET/CT imaging for cyclic RGD and dimeric KTLLPTP peptide models. It was authored by Quan Zuo and uploaded to figshare on 2026-04-15.
On-street electric vehicle charging bays are part of a trial managed by Ausgrid and partners EVX and Plus ES. The trial aims to understand charging demand and support EV owners without home charging access. The dataset is published by the City of Sydney and was last updated in June 2026.
Over 1,250,000 square kilometres of central and eastern Australia are covered by this hydrogeological inventory of the Eromanga Basin. The dataset, provided by Geoscience Australia Data, groups descriptive attributes into themes like geology, groundwater management, and land use. It details the basin's Mesozoic sedimentary rocks and their complex depositional history linked to the breakup of Gondwana.
A 100M English language model instruction tuning dataset used for supervised fine-tuning. The dataset, created by Aeryx-ai, combines the shared ChatML instruct dataset, SmolTalk core, and Dolly-15k. It was used in an experiment comparing two ~100M-parameter models with identical architecture and SFT but different pretraining token budgets.
A sample of 10 records from a premium dataset of AI-parsed analytics from SEC 8-K filings. The data includes sentiment, key metrics, and forward-looking statements extracted from filings of top S&P 500 companies, including Exhibit 99.1 press releases. The dataset was created by deniks315 and was last updated on 2026-06-03.
Mesa, Arizona's electric utility data for computing the System Average Interruption Duration Index (SAIDI). The dataset includes Year, Month, and Average Duration of Interruptions Per Customer, with reporting potentially lagging by 14 days. It is hosted by citydata.mesaaz.gov and was last updated on May 7, 2026.
The Northwest Australian continental shelf, spanning 1200 km from Barrow Island to Scott Reef, was surveyed during two 3-month cruises in 1967 and 1968. The Bulletin presents geological reconnaissance results from the Bureau of Mineral Resources, incorporating seismic profiles, echograms, and sediment notations from Admiralty Charts. The description of offshore structure and Phanerozoic sedimentation is based on petroleum exploration work up to 1971.
WorldTasks is a dataset for training world models via self-distillation, created by sebastian-stapf and hosted on Hugging Face. It pairs visual scenes with compact task instructions and detailed solution descriptions. The dataset was last updated on June 11, 2026.
Land classification for the Mineral Resources (Sustainable Development) Act 1990, developed by the Department of Energy, Environment and Climate Action. This geospatial dataset amalgamates features from multiple public land management sources to identify areas where special permission is required or where mining and exploration are unavailable. The data was last updated on April 9, 2026.