Loading...
Loading...
Text classification, translation, QA, summarization, dialogue, sentiment analysis, language modeling, text corpora
44,649 datasets
The Eromanga Basin dataset from the Australian Ocean Data Network contains descriptive attribute information for spatial groundwater features in the Great Artesian Basin. It covers over 1,250,000 square kilometres in central and eastern Australia, with data grouped into themes like location, geology, hydrogeology, and land use. The dataset was last updated on 2026-04-16.
An estimated 362,869 boxing injuries treated in U.S. emergency departments from 2000 to 2023 were analyzed from the National Electronic Injury Surveillance System. The study by Jenna Tsuzaki, published on figshare in 2026, reports injury rates, demographics, diagnoses, and mechanisms. It finds a 46.6% increase in the injury rate over the period, with fractures being the most common diagnosis.
Wide-angle seismic data from ocean bottom seismographs, together with gravity and deep marine reflection profiling data, define crustal-scale features along the Vulcan transect in northern Australia. The dataset, provided by Geoscience Australia, outlines the crustal and upper mantle architecture across the boundary between the Australian and SE Asian plates. It includes interpretations of crustal thickness, basin sequences, and evidence of intrusive rocks at depth.
The Albany Canyon complex extends 700 km from Cape Leeuwin to east of Esperance, with canyons cutting down 1500-2000 m in places. Data from Geoscience Australia includes information from seismic profiles and describes canyon morphology, orientation, and exposed Jurassic and younger sequences. This dataset was last updated on 2026-04-30.
A framework for modelling beach erosion from clustered storms, focusing on two case study areas in southeast Australia: the Adelaide metropolitan coast and Old Bar beach. The dataset integrates coastal geomorphology and engineering approaches, using sub-surface information like boreholes and ground-penetrating radar to estimate sediment volumes. This work is a contribution to the Bushfire and Natural Hazard Cooperative Research Centre project on storm surge resilience.
Laboratory-grown foraminifera tests of the planktic species Globigerina bulloides were collected from the Norwegian Sea in summer 2022. The dataset provides element/Ca ratios (Mg/Ca, Na/Ca, Sr/Ca) for these cultured specimens and their culturing substrate, measured by Laser Ablation ICP-MS at the University of Southampton in summer 2023. Data corresponds to articles by Sykes et al. (2024 and in submission).
Two data points were generated by Cogniti AI. The dataset is a 441.9 KB file in PNG format, authored by Qiaoying Liang and last updated on May 28, 2026. It is shared under a CC-BY-4.0 license on figshare.
A catalog of flaring gamma-ray sources detected by the Fermi Large Area Telescope over 7.4 years, from August 2008 to January 2016. The Fermi All-sky Variability Analysis (FAVA) technique was used to search for flares in weekly time bins across two energy bands. This catalog was produced by NASA and ingested by the HEASARC in July 2017.
A synthetic dataset of 234 question-answer pairs designed to mirror the distribution of the MMMU-Pro benchmark. It contains 78 unique questions across 30 academic subjects, each presented in three different visual and textual formats. The dataset was created by YiYang109 and last updated on Hugging Face in May 2026.
Seasonal riverine discharge drives large intra-annual variations in temperature (13-29°C) and salinity (3-30) at two sites in the Swan River estuary. Anoxia in bottom waters associated with a salt wedge increased ammonium and phosphate concentrations, especially at the deeper site. The dataset, sourced from Geoscience Australia Data, examines major ions, nutrients, and chlorophyll a to assess nutrient limitations on phytoplankton growth.
British Geological Survey data from the WREED project includes analyses of archive rock and soil samples from rare earth element deposits in Mongolia and China. The data characterizes mineralogy, bulk rock geochemistry, and sequential leaching experiments on laterite, weathered rock, and soil overlying carbonatite-related REE deposits. It was collected to determine enrichment and depletion of REE relative to bedrock, the mineral host of REE, and the ease of extraction.
2022-2023 photographs from an optical microscope using transmitted and reflected light. The images visualize spatial textures and microstructures in samples from the Warton slag bank. The data was collected by John MacDonald and Robin Hilderman of the University of Glasgow and is held by the British Geological Survey.
2022-2023 raw X-Ray Diffraction (XRD) analysis data for samples collected from four slag bank field locations: Warton, Glengarnock, Derwent Howe, and Harrington in Scotland and northwest England. The data was collected by John MacDonald and Robin Hilderman of the University of Glasgow for the purpose of identifying sample mineralogy. The dataset is hosted by the British Geological Survey (BGS).
SGI-Bench is a scientist-aligned benchmark for evaluating Scientific General Intelligence in large language models. It spans 10 scientific disciplines and contains more than 1,000 expert-curated samples inspired by Science's 125 Big Questions. The dataset was created by InternScience and last updated on Hugging Face in June 2026.
A dataset snapshot of pre-tokenized sequences used in the paper 'Data-Constrained Language Model Pretraining: Improved Regularization and Scaling Laws'. The data consists of packed GPT-2-tokenized sequences derived from the DCLM corpus, prepared for studying pretraining in data-constrained, compute-rich regimes. The snapshot was uploaded by author zhiwei555 to Hugging Face.
The XMM-Newton Serendipitous Source Catalog from Stacked Observations contains 427,524 unique X-ray sources, with 329,972 observed multiple times. Compiled by NASA from 10,336 overlapping XMM-Newton observations taken between 2000 and 2023, it provides source parameters like fluxes and hardness ratios derived from simultaneous fits. The catalog includes 1,807,316 individual flux measurements aimed at studying long-term variability of X-ray emitting sources.
A dataset created using the Easy Dataset tool for streamlining fine-tuning datasets for Large Language Models. It contains data related to the Jeppesen programming language and is formatted according to the alpaca structure. The dataset was last updated on June 12, 2026.
Geoscience Australia's 2004 Southwest Frontiers Survey acquired 2700 km of industry-standard seismic data to study the continental margin. One key finding is that basement velocities are in the 5.2-5.6 km/s range, indicating a composition that is likely not granitic. Results from a conjugate site in Antarctica, obtained by a Russian expedition in 2004-2005, show consistent low velocities, suggesting a ~400km wide zone in the pre-breakup Gondwana supercontinent.
The XMM-Newton Serendipitous Source Catalog from Stacked Observations (4XMM-DR14s) contains 427,524 unique X-ray sources compiled from 10,336 overlapping observations taken between 2000 and 2023. It includes 1,807,316 individual flux measurements, with source parameters like fluxes, hardness ratios, and variability information derived from simultaneous fits. The catalog was produced by the National Aeronautics and Space Administration (NASA) and aims to study the long-term behavior of X-ray emitting sources.
ViroBlend is a 216 megabase pair mixed pre-training corpus introduced by YDXX. It combines broad genomic context with enriched viral signals using source-wise stratified sampling to balance three heterogeneous data sources. The dataset was last updated on 2026-05-29.