Loading...
Loading...
Text classification, translation, QA, summarization, dialogue, sentiment analysis, language modeling, text corpora
43,044 datasets
An article extending a p-value-based multiple testing procedure for scenarios where study success requires at least k out of m hypotheses to be rejected. The extension replaces an initial gatekeeping step with a Fixed-Sequence MTP, allowing inferences even if the gate is not passed. The work includes an R function for calculating adjusted p-values and is licensed under CC-BY-4.0.
Nemotron-RL-SysBench-v1 is a text dataset for training and evaluating reinforcement learning agents on instruction and system-message following. The dataset was created by NVIDIA using a hybrid method of manual collection and synthetic generation. It is associated with the Nemotron Ultra model and was last updated on June 4, 2026.
NASA's New Horizons spacecraft collected this calibrated radio science data between 08/14/2018 and 01/31/2019 during the KEM1 mission phase targeting object MU69. The dataset includes calibration measurements using known radio sources, Jupiter, and cold sky, along with operational readiness tests and prime science observations. This is Version 1.0, containing only data downlinked before 02/01/2019.
The dataset describes the continental shelf off southeast Australia between Sugarloaf Point and Gabo Island. It details shelf morphology, sediment types, and geological features, likely compiled by the Australian Ocean Data Network. The dataset was last updated on 2026-05-05.
4500 tonnes of tungsten concentrates and 15 kg of gold were recorded from mineralisation in the Davenport province. The dataset describes the sedimentary, volcanic, and intrusive rocks of this Proterozoic geological province, including stratigraphic groups, formation ages, and geophysical characteristics. It is provided by the Australian Ocean Data Network via data.gov.au and was last updated in May 2026.
Nemotron-RL-CFBench-v1 is a dataset for reinforcement learning and text generation, focusing on instruction and constraint following. It is a hybrid dataset, manually collected and synthetically generated, and is associated with the Nemotron Ultra model. The dataset contains text in multiple languages, including English, Arabic, Hindi, Chinese, Japanese, and Korean.
Nemotron-RL-InverseIFEval-v1 is a text dataset for evaluating and training instruction-following models under adversarial conditions. Created by NVIDIA, it contains a hybrid collection of manually collected and synthetic data, with a capability breakdown focused entirely on counter-conventional instructions. The dataset was last updated on June 4, 2026.
Jervis Bay, New South Wales, is the location for marine surveys conducted by Geoscience Australia between 2007 and 2009. The dataset contains reference images of benthic infauna specimens, organized into four main phylum folders: Annelida, Crustacea, Echinodermata, and Mollusca. Data and samples were acquired from the MV Kimbla vessel, with a focus on a 3x5 km survey grid in the southern part of the bay.
A study conducted between September 2000 and January 2001 evaluated a silty clay cover installed to prevent Acid Rock Drainage at a former gold and silver mine near Carcross, Yukon. EBA Engineering Consultants Ltd., the Carcross Tagish First Nation, and partners, on behalf of the Mining Environment Research Group and Indian and Northern Affairs Canada, installed equipment and completed three rounds of testing. The data gathered includes measurements of thickness, oxygen concentrations, temperatures, and moisture levels within the tailings and the cover.
British Columbia's municipal tax data from 2003 to 2008, reported by local governments to the provincial ministry. The dataset contains statistics on taxes imposed and collected, compiled from annual financial reports following Generally Accepted Accounting Procedures (GAAP) for local governments. Data for regional districts incorporates current-year property assessments and certified population estimates.
The Environmental Trends in British Columbia 2007 dataset contains geographically-based regional data reported in the 2007 publication. The spreadsheet is designed with column filters to select topics, environmental indicators, or geographical areas, helping users find data for specific communities. Data resolution is provided, and data were reported as either point data (site, city, town) or area data (watershed, regional district, ecosection, ecoprovince).
164 man-made reservoirs and regulated natural lakes globally are monitored for monthly water storage changes. The dataset provides a monthly time series derived from satellite classifications and models, including reservoir surface area, elevation, storage capacity, evaporation rate, and evaporation volume. Known issues include surface area estimation uncertainties in high-latitude regions and potential overestimation due to lake ice coverage.
552,960 sparse features were extracted from the SmolLM2-135M-Instruct model using 30 custom sparse autoencoders. The atlas covers every layer and component, including MLP, gate, up-projection, and attention heads, with a mean explained variance of 0.9519. It was created by juiceb0xc0de and last updated on June 16, 2026.
3.7 MB of data and analysis code supporting a 2026 study on dopamine's role in larval Drosophila motor control. The dataset, authored by Bella Xu Ying, includes files required to regenerate figures from the associated preprint. It contains results from dual-color calcium imaging, bath application experiments, and optogenetic manipulations during crawling and tunneling behaviors.
Catalogue of fungi in China 7 reports 18 new taxa of lichen-forming fungi discovered in China. The dataset includes one new order, one new family, three new genera, and 13 new species, with 10 collected from the Xizang Autonomous Region. It was authored by Qiu-Xia Yang and last updated on 2026-04 24.
Delta-X 2021 field efforts collected in situ above-water remote-sensing reflectance measurements in the Atchafalaya River and Terrebonne Basins of coastal Louisiana. Data capture spans two seasons, from March to April and August to September 2021, with collection paused and resumed around Hurricane Ida's landfall. The dataset, produced by ORNL_CLOUD, provides Version 3 processed reflectance values calculated from radiance measurements taken with a handheld Portable SpectroRadiometer.
The Gippsland Lakes coastal environment in Australia is covered by this dataset. It represents the inundation extent for a 10% Average Exceedance Probability water level, incorporating a 0.8-meter sea level rise condition based on hydrodynamic modeling. The data was produced by the Department of Energy, Environment and Climate Action and was last updated in April 2026.
A mixed-method study of 605 women who underwent C-Sections in public, private, or semi-private healthcare facilities in Pakistan. The dataset includes qualitative and quantitative evidence on the patterns and drivers of C-Section practices, authored by Maria Atif and last updated in May 2026. It aims to explore stakeholder perceptions and WHO-recommended interventions for reducing unnecessary procedures.
Official mortality data on suicide in the department of Caldas, Colombia, provided by the National Administrative Department of Statistics (DANE). The dataset contains annual suicide rates by municipality, with records from the year 2000 to the most recent available date. It allows for analysis by sex and geographic zone to support public health awareness, prevention, and policy formulation.
A registry of beneficiaries for the 'Renta Ciudadana' social welfare program in the municipality of Arboledas, Norte de Santander, Colombia. The dataset includes columns for beneficiary demographics, enrollment status, and location. It was published via the Socrata platform on datos.gov.co and was last updated on 2026-05-18.