Loading...
Loading...
Text classification, translation, QA, summarization, dialogue, sentiment analysis, language modeling, text corpora
44,668 datasets
Geoscience Australia's Semi-automated Morphological Mapping Tools (GA-SaMMT) are a suite of seven ArcGIS Pro Python toolboxes for seabed characterisation. The tools map ten bathymetric high and eight bathymetric low morphology features, plus three morphological surface classes, as defined in published research. The package includes tutorials, a user guide, sample data, and has been applied to multiple real-world study areas.
Sexual offense statistics for the period from January 1 to December 31, 2019, compiled by the Colombian Ministry of Defense's Directorate of Criminal Investigation and INTERPOL. The dataset contains 22 columns detailing the crime, victim demographics, and incident circumstances. It is hosted on the Colombian open data portal, datos.gov.co.
ToxSyn-PT is a large-scale synthetic dataset containing 53,274 sentences for hate speech detection in Brazilian Portuguese. The dataset is equally balanced between toxic and non-toxic labels and covers nine legally protected minority groups, including Black, Women, LGBTQIA+, and Elderly. It was created by AKCIT and last updated on the platform in June 2026.
ESRI grids provide sea salinity data for the Australian region (100-200E, 50-0S) on a 0.1-degree grid. The data is derived from the CARS2000 seasonal climatology, which was generated using a Loess filter on historical oceanographic data from sources like the World Ocean Atlas 98 and CSIRO archives. It includes values at depths of 0, 150, 500, 1000, and 2000 metres, representing mean and seasonal cycles.
A dataset of student responses to structured examination questions for the COS101: Introduction to Computing course at FUHSO, Nigeria, collected for the 2025/2026 academic year. The data, created by TEMIDAYO OMOTEHINWA, was used to develop an AI-driven short-answer grading system that combines semantic similarity with rubric-based evaluation. It includes student answers paired with model answers and mark allocations.
Sites where earth resources like metallics, industrial minerals, and construction materials have been demonstrated, excluding oil, gas, and groundwater. The dataset originates from the VICMINE RDBMS, compiled largely from historical literature with selective field visits. It was last updated by the Department of Energy, Environment, and Climate Action in April 2026.
Heliocentric trajectory data for the BepiColombo mission and Comet Borrelly, provided in Heliographic (HG), Heliographic Inertial (HGI), and Solar Ecliptic (SE) coordinate systems. The data is produced by NASA using the 'Mean of Date' method for the Equinox Epoch and sourced from JPL Horizons. The dataset was last updated on March 13, 2026.
Heliocentric trajectory data for Comet Borrelly, calculated using the 'Mean of Date' method for the Equinox Epoch. The data is provided in Heliographic (HG), Heliographic Inertial (HGI), and Solar Ecliptic (SE) coordinate systems, sourced from NASA's JPL Horizons system. The dataset is maintained by the National Aeronautics and Space Administration and was last updated in March 2026.
Heliocentric trajectory data for Comet Giacobini, calculated using the 'Mean of Date' method for the Equinox Epoch. The original data is sourced from NASA JPL's Horizons system, which provides ephemerides for many solar system objects. This dataset includes daily positions in Heliographic, Heliographic Inertial, and Solar Ecliptic coordinate systems.
Heliocentric trajectory data for Comet Grigg-Skjellerup, calculated using the 'Mean of Date' method for the Equinox Epoch. The data is provided in Heliographic (HG), Heliographic Inertial (HGI), and Solar Ecliptic (SE) coordinate systems. This dataset is produced by the National Aeronautics and Space Administration and was last updated on March 13, 2026.
NASA provides daily heliocentric trajectory data for Comet Hale-Bopp in Heliographic (HG), Heliographic Inertial (HGI), and Solar Ecliptic (SE) coordinate systems. The data is derived from JPL Horizons ephemeris and calculated using the 'Mean of Date' method for the Equinox Epoch. The dataset was last updated in March 2026.
MODIS/Aqua Cloud Properties Level 3 monthly product provides gridded statistics on cloud characteristics to ensure continuity between MODIS and VIIRS instruments. The dataset includes scalar and histogram data calculated identically to the standard MODIS Level-3 products. It is produced by the LAADS organization and was last updated in March 2026.
Global satellite data provides continuity for cloud property statistics between MODIS instruments on the Aqua and Terra platforms and VIIRS. The Level-3 gridded product includes scalar and histogram statistics calculated identically to the standard MODIS products. It is produced by LAADS and is actively maintained, with a last updated date noted as March 12, 2026.
From May 28 to October 17, 1987, this dataset contains surface flux and micrometeorological measurements collected at a central, uniform vegetation site during the FIFE study's four Intensive Field Campaigns. The Bowen ratio system captured all major components of the surface energy budget. It includes a large set of measured and derived parameters describing dynamical, thermodynamical, hydrological, and radiative properties of the ground surface and atmosphere.
February 1994 data from the BOREAS project cover portions of the SSA, NSA, and transect areas in the boreal forest. The dataset contains surface meteorological inputs for an energy balance model used to estimate snow water equivalent (SWE). These SWE estimates are compared with in-situ observations to assess the accuracy of airborne and spaceborne microwave retrieval algorithms.
Anonymized results from the generic component of the Saber TyT exam for Colombian technical and technological higher education programs. The dataset is hosted by the Colombian open data portal www.datos.gov.co and was last updated in May 2026. It contains individual student-level data, including test scores, socioeconomic indicators, and institutional information.
Level-0 AOCI imagery was collected on a single flight on 21-Jul-1994 to provide spatially extensive radiant energy information over the BOREAS study areas. The instrument's wavelength bands were specific to investigating aquatic parameters like chlorophyll content and turbidity. Companion files include an image inventory and example thumbnails for data access.
30,000 MSCOCO-2014 validation captions used for FID evaluation in the MiniT2I PyTorch/Diffusers release. The assets include a JSON file of captions and an NPZ file containing reference Inception statistics (mu and sigma arrays) for 512x512 images. The data was created by MiniT2I and last updated on June 14, 2026.
FineWeb-English-Filtered is a large-scale, cleaned, English-only text dataset derived from Common Crawl’s WET archives. It contains 940 million documents of publicly available web text, converted into Apache Parquet format with a consistent schema for fast and efficient data loading. The dataset was generated by anandjh8 using a custom AWS Glue pipeline and was last updated on 2026-06-04.
Salford City Council discloses remuneration details for senior staff earning over £50,000, as mandated by the Local Government Transparency Code 2014. The dataset includes salary brackets, job titles, responsibilities, and details on bonuses and benefits-in-kind. It is published by the Government Digital Service via the eu_open_data platform.