Loading...
Loading...
Text classification, translation, QA, summarization, dialogue, sentiment analysis, language modeling, text corpora
44,462 datasets
A 2013 catalog of 5,106 infrared bubbles in the Milky Way, created by NASA HEASARC based on citizen scientist classifications from The Milky Way Project. The catalog provides consensus parameters for bubble positions, radii, thicknesses, eccentricities, and position angles, with each object measured by at least five individuals. This first data release includes bubbles that rediscover 86% of objects from three prior catalogs and identifies 29% of bubbles as nested or rim-associated.
ESQUEMA DE PUBLICACIÓN DE INFORMACIÓN, PERSONERÍA DE ENVIGADO is a structured catalog from the Colombian open data platform www.datos.gov.co. It describes information published and to be published by the obligated entity, in accordance with proactive disclosure principles under Law 1712 of 2014. The dataset was last updated on 2026-05-18 and includes 15 columns detailing format, responsible area, description, and access methods.
A dataset from a 2026 figshare study by Wenting He investigating the impact of generative AI on human creative agency. The data likely contains results from a between-subjects experiment with 162 participants who completed a music co-creation task. It examines how AI automation level affects subjective task load, psychological ownership, and state sense of agency, moderated by musical expertise.
An inventory of public information generated, obtained, acquired, or controlled by the Institute for the Development of Antioquia (IDEA) that has been classified as confidential or reserved. The dataset includes 16 columns detailing the legal basis, responsible parties, formats, and classification terms for each record. It was last updated on 2026-05-18 and is hosted by the Colombian open data portal www.datos.gov.co.
7,000 Chinese text pairs for modern Chinese to Lu Xun style rewriting. The modern Chinese source side was generated by DeepSeek V4 Flash through an API-based modernization pipeline, while the target side contains Lu Xun style Chinese text. The dataset was created by liuyanliang and last updated on Hugging Face in June 2026.
200 social media posts represent the top 20 most reposted items each month over a 10-month period. The data is annotated with 5 generic and 5 issue-specific frames, such as Conflict and Migration Flows, across four political groups. Author Tomasz Piróg released this dataset under a CC-BY-4.0 license on figshare.
An inventory of public information generated or controlled by the Sogamoso Chamber of Commerce that has been classified as confidential or reserved. The dataset includes 13 columns detailing the content, legal basis, responsible parties, and classification terms for each record. It is published by the Colombian open data portal, www.datos.gov.co, and was last updated in May 2026.
193,938 long-form reasoning traces and solutions for research-level mathematical problems, released alongside ResearchMath-14k. The dataset contains model-generated solution attempts, each with a problem statement, a chain-of-thought reasoning trace, and a final response. It was authored by 'amphora' and last updated on Hugging Face in June 2026.
The Brera Multi-scale Wavelet Chandra Source Catalog (BMW-Chandra) contains 21,325 X-ray sources identified from 136 Chandra ACIS-I observations public as of March 2003. The NASA HEASARC created this table in September 2008 based on the CDS catalog J/A+A/488/1221, making it the largest compilation of Chandra sources at its publication date. It includes source positions, count rates in multiple energy bands, flux estimates, and cross-matches with other astronomical catalogs.
All triggers observed by the 14 detectors of the Fermi Gamma-ray Burst Monitor (GBM), including 12 sodium iodide and 2 bismuth germanate detectors. The catalog is automatically updated within about a day of data processing by NASA's HEASARC, with latency requirements of 1 day for triggers and 3 days for bursts. Data originates from the Fermi GBM Instrument Operations Center and Fermi Science Support Center, provided as FITS files.
REGISTRO DE ACTIVOS DE INFORMACIÓN CÁMARA DE COMERCIO DEL CAUCA is a public information asset inventory from the Cauca Chamber of Commerce in Colombia. The dataset likely contains metadata about public information generated or controlled by the Chamber, including its format, language, and category. It was last updated on 2026-05-18.
Geoscience Australia data examines the effects of eight spatial reference systems on the predictive accuracy of spatial interpolation methods for seabed sediments. The study applied inverse distance squared and ordinary kriging to marine data within the Australian Exclusive Economic Zone, assessing accuracy via cross-validation and map visualization. Results indicate negligible differences in predictive accuracy between the tested geographic coordinate systems and map projections.
Over 100,000 paper manifests were received annually, detailing hazardous waste shipments within Connecticut. The dataset includes generator, transporter, and treatment facility information, compiled by the Connecticut Department of Energy and Environmental Protection. Records span from 1984 to 2008.
Red Peatonal Pereira is a dataset describing the pedestrian network of the city of Pereira, Colombia, sourced from www.datos.gov.co. The data is intended to connect the urban territory, making communication nodes, facilities, and public spaces accessible to citizens traveling on foot. The dataset was last updated on 2026-05-18 18:28:19.
Net changes in the distribution areas of phyllostomid genera in the Neotropics are reported under different climate change scenarios for 2040. The dataset was authored by Daryl Cruz and published on figshare under a CC-BY-4.0 license. It was last updated on May 22, 2026.
June 2026 submissions from 8 frontier coding models, including Claude Opus 4.8 and GPT-5.5, autonomously writing CUDA/Triton GPU kernels. Each model had one unlimited-time run per problem to write the fastest kernel for an NVIDIA RTX PRO 6000 Blackwell GPU, graded as peak_fraction of the hardware roofline. The dataset was created by Infatoshi and hosted on Hugging Face.
A validation dataset comparing smartwatch-measured and self-reported sleep parameters from 130 participants over 841 sleep instances. The data was collected between November 2023 and June 2024 from participants wearing three generations of Garmin smartwatches. It was authored by Christina T. Saliba and shared under a CC-BY-4.0 license.
Yuanxin Cheng's dataset contains results from a study on trehalose enhancing postharvest Shine Muscat fruit resistance to gray mold (Botrytis cinerea). The data includes 2201 differentially expressed genes and 383 differentially expressed metabolites identified through comparative omics analyses. The dataset was last updated on 2026-05-01 and is shared under a CC-BY-4.0 license on figshare.
Kimberley Marine Park in Australia's Commonwealth waters contains a 30-meter resolution bathymetric grid and derived morphological surfaces. The data was processed by Geoscience Australia using a two-part seafloor classification scheme that categorizes slope into Plains, Slopes, and Escarpments. This release supports the management of Australia's network of 58 marine parks covering 3.3 million square kilometres.
Pre-registration data for the 2025 Police Patrol Officer recruitment call in Colombia, sourced from datos.gov.co. The dataset includes applicant demographics such as marital status, gender, academic level, and geographic location. It was last updated on 2026-05-18.