Loading...
Loading...
Text classification, translation, QA, summarization, dialogue, sentiment analysis, language modeling, text corpora
44,732 datasets
Nemotron-SFT-ARC-AGI-v1 is a supervised fine-tuning dataset of multi-turn agentic reasoning traces. It was created by NVIDIA using nine open-weight large language models attempting to solve ARC-AGI visual-reasoning puzzles. The dataset was last updated on June 4, 2026.
Local Code Arena Telemetry captures raw evaluation metrics and execution logs from running the Mostly Basic Python Problems benchmark against the Qwen3 1.7B parameter model. The dataset was created by ShahzebKhoso and last updated on 2026-05-29. It provides a direct point of comparison for evaluating next-generation AI models on consumer hardware.
Eight sediment cores from Vincennes Bay and 19 from Prydz Bay were collected during the 1996/97 Antarctic season to study ice sheet retreat. About 200 km of seismic data from Vincennes Bay and 900 km from Prydz Bay reveal glacial erosion patterns and moraine structures. This post-cruise report summarizes preliminary results from the AGSO/ANARE marine geoscience program in East Antarctica.
Over one million variation tuples derived from variable Google Fonts, used for training the NIV (Neural Axis Variations) model. The dataset comprises per-point displacements for font outlines. It was created by ndvb and was last updated on the platform in June 2026.
A qualitative dataset from the SEENEZ GH trial, containing interview data from 26 participants. The data was collected by researchers to analyze preferences for continuing or discontinuing growth hormone treatment in adolescents with transient idiopathic isolated growth hormone deficiency. The dataset was uploaded by figshare admin karger and last updated on April 22, 2026.
Inner Darwin Harbour and shallow water areas in and around Bynoe Harbour were surveyed from 29 May to 16 August 2017. The project collected 285 seabed sediment samples for grain size, inorganic elemental, and organic matter analyses, alongside seagrass and hardground observations. This work was part of a four-year (2014-2018) science program led by the Northern Territory Government and funded by the INPEX-led Ichthys LNG Project, in collaboration with Geoscience Australia and the Australian Institute of Marine Science.
MedSP1000 is an interactive benchmark derived from standardized patient cases for evaluating large language models as clinical agents. The dataset, created by byrLLCC and described in a 2026 paper, focuses on dynamic, multi-turn clinical encounters rather than static medical question-answering.
Nemotron-SFT-Math-v4 is a large-scale mathematical reasoning dataset containing model-generated reasoning trajectories. Solutions were generated using DeepSeek-V4-Pro on High inference mode. The underlying problems are sourced from the nvidia/Nemotron-Math-v2 dataset, which contains high-quality mathematical problems derived from the Art of Problem Solving (AoPS) community and Math StackExchange/MathOverflow.
Six CSV files support the analysis of ethnic pictorial manuscripts from Yunnan-Guizhou. The data includes coding of agricultural tool morphology across five manuscript versions, symbol-ethnic group co-occurrence frequencies, and a policy-artifact time series from 1730 to 1790. Author Xin Wu published this dataset on figshare in 2026 under a CC-BY-4.0 license.
A 38.7 KB Excel database supporting a 2016 master's thesis and a 2019 book chapter on news framing. It was created by Edrei Álvarez-Monsiváis and last updated on 2026-05-15. The data likely contains coded content from news articles about celebrity Caitlyn Jenner.
May 2026 saw the creation of 5,000 unique synthetic examples designed to teach step-by-step reasoning. The dataset was programmatically generated by gss1147 to mirror the thinking style of Meta's Muse Spark frontier model. It contains reasoning traces structured around the steps: Understand, Plan, Execute, and Verify.
59.6 KB of data supporting an AI-assisted framework for inductive theory building in sustainable investment research. The dataset, created by Gunawan Wibisono and last updated in April 2026, was derived from a Scopus-screened corpus of academic literature. It models an integrative conceptual architecture organized around cognitive, structural, and bridge mechanisms.
5.5 KB of tabular data presents numerical simulation results for a convection-diffusion model used in plastic manufacturing. The dataset, created by Ahmed M. Abed, contains error metrics and outcomes from a mathematical poka-yoke simulator designed to reduce defects. It was last updated in April 2026.
Ahmed M. Abed created a 5.5 KB Excel dataset containing numerical simulation results for a convection-diffusion model in plastic manufacturing. The data includes tabular and graphical outcomes from a mathematical poka-yoke simulator, used to analyze defect causes. The dataset was last updated in April 2026.
A 9.5 KB Excel file contains pseudocode for the Mat-Poka-Yoke System (Mat-PYS), a control mechanism for plastic injection molding. The system was developed by Ahmed M. Abed and last updated in April 2026. It mathematically models convection-diffusion to reduce defects and improve machine efficiency.
India's linguistic diversity across all districts is captured in this derived dataset from Project Vaani, a large-scale multilingual speech initiative by IISc Bangalore and ARTPARK. The dataset contains noise event timestamps and is actively being built, with a current subset of a planned corpus of approximately 167 hours of training data. The dataset page was last updated on 2026-06-05.
Individual plant-level measurements of growth and yield-related traits for diploid (2x; 'Keleti1') and tetraploid (4x; 'Keleti1T') perennial rye genotypes. The dataset is 14.2 KB in size and was authored by Ahmed Ali Hamad, last updated on May 13, 2026. Missing data are indicated as 'NA' and values represent direct measurements or derived means per plant.
Primer sets and genomic data for characterizing the multiallelic mating-type loci in the edible oyster mushroom Pleurotus ostreatus. Yi-Yun Lee developed this resource, which includes analysis of 12 haplotypes identifying 11 A and 12 B alleles. The dataset was last updated in April 2026.
Several pyranometers collected solar radiation data for 3-4 consecutive days in jack pine (1994) and black spruce and aspen forests (1996). The BOREAS HYD-03 team used this array to test the hypothesis that energy transfer and snow water equivalent vary spatially with canopy closure. Data quality is noted as good due to generally clear days and daily maintenance of the radiometers.
Landsat TM data from 22-Jun-1984 to 30-Jul-1996 provides spatially extensive information for the BOREAS study areas. The imagery includes radiant energy, detailed land cover, and biophysical parameter maps such as FPAR and LAI. It primarily covers the Northern and Southern Study Areas (NSA/SSA) of the Boreal Ecosystem-Atmosphere Study.