Loading...
Loading...
Text classification, translation, QA, summarization, dialogue, sentiment analysis, language modeling, text corpora
44,732 datasets
PhysicalAI WorldModel Synthetic Embodied Robot Scenes is a large-scale synthetic robotics video corpus. It was generated from USD-based robotic simulation and rendering pipelines built around NVIDIA Isaac Sim, Omniverse, and Isaac Lab. The dataset was last updated on 2026-05-31.
Legacy product from the Australian Ocean Data Network with no abstract available. The dataset describes heavy-mineral deposits along the coasts of three Australian states. It was last updated on 2026-06-05.
Armesto, Alejandra created a multilevel dataset analyzing particularistic spending by subnational governments. The dataset contains 2,552 municipal-level observations for Mexico and 644 departmental-level observations for Argentina, covering the period from 1993 to 2005. It was last updated on the Harvard Dataverse platform in May 2026.
42.0 KB of paired measurement data for validating an automated phantom-less quantitative CT system called "Bone's FRAX" in osteoporosis screening. The dataset records volumetric bone mineral density for T11, T12, and L1 vertebrae from participants scanned on five CT scanner models, comparing results from the new method and the gold standard phantom-based approach. Authored by Yizhang Tong and last updated on 2026-05-12.
A collection of prompts and synthetically generated responses designed to align large language models with safety and security values. The dataset was created by NVIDIA and was last updated on June 4, 2026. It likely contains a hybrid of open-source and synthetic prompts targeting various model vulnerabilities.
53,713 separate polygons classify the global ocean floor into 11 distinct 'seascapes' based on six biophysical variables. This dataset, hosted by the Australian Ocean Data Network, was created using a multivariate statistical method to identify candidate sites for high seas marine protected areas. The analysis uses GIS tools to map seascape and geomorphic heterogeneity, aiming to provide an unbiased basis for conservation planning.
MYD09GA Version 6.1 provides daily, atmospherically corrected surface spectral reflectance for Aqua MODIS Bands 1-7. It serves as a foundational source for many downstream MODIS land products, offering data at 500-meter resolution alongside ten 1-kilometer observation bands and geolocation flags. The dataset includes documented known issues, such as non-functional detectors in Band 6, and incorporates calibration improvements like polarization corrections for Reflective Solar Bands.
Version 6.1 data provides daily global surface spectral reflectance estimates from the Aqua MODIS satellite at 250-meter resolution for bands 1 and 2, corrected for atmospheric gases, aerosols, and Rayleigh scattering. The product includes a Quality Assurance layer and five observation layers and is intended for use with the 500-meter MYD09GA product for quality and viewing geometry information. Known issues include non-functional detectors in Band 6, requiring users to consult detector flags and the MODIS Characterization Support Team website.
A curated collection of 4.79 million Wikipedia articles from the 2008 and 2010 snapshot releases. The dataset is cleaned and compressed for efficient large-scale language model pretraining. It was created by the author 'adhyanshaa' and last updated on the platform in May 2026.
3.1-billion-year-old Mesoarchean metapyroxenite samples from the Coorg Massif, India, provide constraints on mantle dynamics and crust-mantle interactions. The dataset includes integrated petrography, whole-rock geochemistry, mineral chemistry, and zircon U-Pb geochronology. It was authored by V. Deepchand and published on figshare in April 2026.
A 2026 report from the Government of Yukon reviews industrial minerals and minor metals alphabetically. It includes information on mineral types, uses, deposit characteristics, producers, market specifications, and prices, with specific focus on Canadian deposits and Yukon occurrences. The report comments on the likelihood of discovery and development potential, grouping minerals by their current status and future prospects.
A geological map describes the Mount Nansen and Stoddart Creek areas in the Dawson Range of Central Yukon. The description details basement rocks, intrusive suites, volcanic formations, and four main types of mineral deposits. It was published by the Government of Yukon and last updated on April 17, 2026.
A high-quality supervised fine-tuning dataset for penetration testing expertise and red team tradecraft. The dataset is structured to teach models how to think like offensive security practitioners, not merely recall labels or technique names. It was authored by me-aas and last updated on 2026-06-03.
A reinforcement-learning gym environment dataset of single-step ARC-AGI puzzle prompts for post-training large language models. Each row contains a text prompt rendering of an ARC puzzle with input-output grid pairs and a test input, with binary reward determined by exact-match comparison. The dataset was created by NVIDIA and last updated on 2026-06-04.
Gold occurrences in the upper Hyland River valley form a 50-km-long belt considered the easternmost portion of the Tombstone Gold Belt. Mineralization consists of four types: disseminated pyrite and arsenopyrite in altered grit, quartz-arsenopyrite veins, quartz-pyrite-galena veins, and massive arsenopyrite veins. The dataset is provided by the Government of Yukon and was last updated in April 2026.
Colombian government data on administrative staff personnel. The dataset includes records categorized by GÉNERO (gender), FORMACIÓN_ACADÉMICA (academic background), CARGO (position), CENTRO DE COSTOS (cost center), and PERIODO (period). It is hosted on the www.datos.gov.co platform via Socrata and was last updated on 2026-05-18.
A large-scale Pashto language corpus containing 11,272,055 text items for NLP research. It comprises 2,021,382 pre-training documents and 9,250,673 instruction pairs for supervised fine-tuning. The dataset was created by codewithnawaB and last updated on Hugging Face in May 2026.
A dataset of physics-validated 4D human-object interaction trajectories for the Unitree G1 humanoid robot, generated by the GRAIL pipeline. It was created by NVIDIA and last updated on HuggingFace in June 2026. The data includes scenarios such as tabletop pickup, ground manipulation, and navigating stairs and slopes.
Porewater and sediment data from a deep subsurface study of a subterranean estuary on Spiekeroog Island, Germany. The dataset includes samples collected to a depth of 24 meters below ground surface along a cross-shore transect, and results from a flow-through reactor experiment. It was authored by Magali Roberts and last updated on 2026-04-21.
Spiekeroog Island, Germany, is the location for this study of rare earth element dynamics in the deep subsurface of a high-energy beach. The dataset is based on porewater and sediment samples collected to a depth of 24 meters below ground surface along a cross-shore transect, and includes results from a flow-through reactor experiment. It was authored by Magali Roberts and last updated on 2026-04-21.