Loading...
Loading...
Text classification, translation, QA, summarization, dialogue, sentiment analysis, language modeling, text corpora
44,732 datasets
A multilingual pretraining corpus of 9,836,075 documents (~8.4B estimated tokens) across 10 Indic languages and English. It was built from the HPLT Monolingual v3 high-quality web crawl data and is hosted on Hugging Face by author ashtok897.
Purchasing expenditure data from the Dutch Ministry of General Affairs for the year 2017. The dataset is used to inform the House of Representatives about government procurement and the share of small and medium-sized enterprises. It originates from the Ministry of the Interior and Kingdom Relations and is published under a CC0-1.0 license.
709,321 real-world software vulnerabilities train VLAI, a transformer model for automated severity classification. The dataset is presented in the paper 'VLAI: A RoBERTa-Based Model for Automated Vulnerability Severity Classification'. It was created by CIRCL and last updated on Hugging Face in May 2026.
A structured diagnostic matrix for Simson mopeds, created by jmp1987 and last updated on June 8, 2026. It maps 13 symptoms to 44 potential causes with associated probabilities, diagnosis steps, and part references. The dataset is designed to support repair and troubleshooting workflows.
Australia, including its outer islands and external territories, is covered by this seamless topographic color mapping service. The data is sourced from Geoscience Australia, the Australian Antarctic Division, OpenStreetMap, and other government programs, and portrays cultural, hydrography, marine, transport, vegetation, and relief themes. The topographic information was checked in 2008 and supplemented in 2009, with contributions acknowledged from several Australian government departments.
Australia's National Base Map provides seamless topographic color mapping for the entire country, including outer islands and external territories. The service integrates data from Geoscience Australia, the Australian Antarctic Division, OpenStreetMap, and other sources, covering cultural, hydrography, marine, transport, vegetation, and relief themes. This specific version does not include any map labels.
Evidence is presented for mineralogical and chemical zoning in seven deposits of the Cobar-Nymagee area. The deposits are contained in distal turbidite facies of the Devonian Cobar Supergroup, deposited in a meridional trough. The dataset is provided by the Australian Ocean Data Network and was last updated on 2026-04-28.
The Australian Ocean Data Network hosts a paleontological dataset describing trilobite occurrences and stratigraphy in western New South Wales. It records nine trilobite species from three localities within the Boshy Formation of the Kayrunnera Group, dated to the early Late Cambrian (Mindyallan). The dataset was last updated on 2026-04-28.
A subset of BridgeData V2 packaged with short robot-manipulation video clips and synthetic video captions. It was created by NVIDIA and last updated on June 1, 2026. The dataset is intended for workflows involving text-to-video, image-to-video, and video-to-video generation of robot manipulation scenes.
Scotland and northwest England slag bank samples from Warton, Derwent Howe, and Harrington were analyzed via Thermogravimetric Analysis (TGA) in 2022-2023. The raw data was collected by John MacDonald and Robin Hilderman of the University of Glasgow to identify volumes of carbon-bearing materials, specifically carbonate minerals. The dataset is associated with NERC Grant NE/X009718/1 and hosted by the British Geological Survey.
2022-2023 data from carbon and oxygen stable isotope analysis of calcite in samples from two field locations, Warton and Glengarnock. The dataset contains raw and processed measurements collected to identify if carbonate minerals contain atmospheric carbon dioxide. Data was collected by John MacDonald and Charlotte Slaymark of the University of Glasgow and is held by the British Geological Survey.
PREFIRE's dual CubeSats carry spectrometers measuring previously unobserved far-infrared radiation from Earth's polar regions to fill knowledge gaps in the global energy budget. This dataset provides cloud-optimized GeoTIFF renderings of retrieved column water vapor values for latitudes between approximately 60° and 84° in both hemispheres. Science data retrieval for this NASA project started on July 24, 2024 and is ongoing.
The Saildrone Arctic 2019 dataset from the NASA NOPP_MISST Project provides high-quality, near real-time surface ocean and atmospheric observations. Six wind and solar-powered Saildrone uncrewed surface vehicles collected data during a 150-day cruise in the Bering and Chukchi Seas from May to October 2019. The mission aimed to improve modeling of diurnal warming and sea-surface temperature algorithms, measuring parameters like air temperature, wind, seawater temperature, salinity, chlorophyll fluorescence, and currents.
United Kingdom data identifies potential areas for seagrass restoration based on physical environmental criteria. The Environment Agency created this dataset, combining wave energy, current energy, elevation, salinity, and turbidity models. A version 3 update was published in April 2026.
50 million (query, document) pairs uniformly sampled from the 'lightonai/embeddings-pre-training-curated' corpus. The dataset was created by author 'capemox' and was last updated on the Hugging Face platform on 2026-05-29. Pairs were sampled proportionally from 34 source subsets using a uniform Bernoulli sampling strategy with seed 42.
A multilingual supervised fine-tuning dataset for text-generation models. It was generated by NVIDIA by translating seed data from three other Nemotron datasets to add coverage for Hindi, Korean, Brazilian Portuguese, and Japanese. The dataset was last updated on June 4, 2026.
Antioquia, Colombia's Comptroller's Office lists its information assets, detailing their content, format, and storage medium. The dataset includes columns for document titles, descriptions, languages, and multiple format and conservation media fields. It was last updated on 2026-05-18 via the datos.gov.co platform.
The Australian Ocean Data Network provides a geological report on the Palaeoproterozoic Tennant Creek and Granites-Tanami Inliers in the Northern Territory. The description details stratigraphy, tectonic evolution, and gold mineralization within these regions. The dataset was last updated on 2026-04-28.
229.6 MB of raw data supporting research on synergistic pest control. The dataset includes TIF and PNG image files alongside XLSX spreadsheets, published by Lu Yu under a CC0-1.0 license in May 2026. It documents experiments combining a fungal biocontrol agent with a chemical insecticide to suppress western flower thrips.
Raw LC-MS data files associated with a specific swab set from the COVIDCAP Protocol paper. The 21.4 MB dataset was authored by Ellen Liggett and last updated on 2026-05-08. Data is provided in MZML format under a CC-BY-4.0 license.