Loading...
Loading...
Text classification, translation, QA, summarization, dialogue, sentiment analysis, language modeling, text corpora
44,558 datasets
More than 3,000 autonomous floats collect high-quality temperature and salinity measurements from the upper 2000 meters of the world's ice-free oceans. Each float completes approximately 150 cycles, surfacing every 10 days to transmit data via satellite. This dataset from Argo Australia and the Australian Ocean Data Network provides real-time observations of oceans surrounding Australia.
SciIR-82k is a large-scale dataset for Scientific Image Reasoning Generation, containing more than 80,000 high-quality scientific image-text pairs. The samples are derived from open-access scientific publications and enriched with structured reasoning annotations. The dataset was created by author 'contton-sss' and was last updated on June 20, 2026.
A dataset from figshare by Fahamida Akter, last updated in April 2026, containing 12.5 MB of files related to cold stress in rice. It includes phenotypic performance data for 38 rice genotypes and supporting images documenting artificial cold screening at seedling and reproductive stages. The data covers traits like leaf discoloration scores, survival rates, and cluster analysis of cold-related traits.
A 3.9 GB repository related to a biometric gait system publication. It contains files for minimal reproduction of experiments on the SIGNET data corpus and a notebook with results. The dataset was authored by aleksander sawicki and last updated on 2026-05-27.
Experimental data from an in vitro study evaluating three moisturizing pretreatments on reusable dental instruments. The dataset includes cleanliness scores, ATP values, and SEM/EDS analysis results for 30 surgical burs and 30 Nickel-Titanium files per type. Xiuyu Tang published the data on figshare in April 2026.
A 1.9 GB repository enabling minimal reproduction of experiments from the BUT data corpus, related to the publication 'Behavioral Biometrics in VR: Changing Sensor Signal Modalities'. It was authored by Aleksander Sawicki and last updated on 2026-05-27.
A public information asset registry from the Colombian Institute of Family Welfare (ICBF), created to comply with Law 1712 of 2014. The dataset likely contains metadata on information categories the entity generates, obtains, acquires, transforms, or controls. It was last updated on 2026-05-18 and is available via the www.datos.gov.co platform.
Released with the paper 'Imaginative Perception Tokens Enhance Spatial Reasoning in Multimodal Language Models' (arXiv:2606.03988). The dataset was authored by weikaih and last updated on Hugging Face on 2026-06-08.
A 15.2 KB dataset from figshare contains results from a study investigating dynamic alterations in gut microbiota following a 30-minute high-intensity treadmill run in BALB/c and C57BL/6 mice. Colonic content samples were collected at 0, 30, and 60 minutes post-exercise for 16S rRNA gene sequencing. The dataset, authored by Ruolin Gao and last updated in April 2026, shows strain-specific microbial changes and energy metabolism responses.
A study investigating dynamic alterations in gut microbiota following a 30-minute high-intensity treadmill run in BALB/c and C57BL/6 mice. Colonic content samples were collected at 0, 30, and 60 minutes post-exercise for 16S rRNA gene sequencing. The dataset, authored by Ruolin Gao and last updated in April 2026, is shared under a CC-BY-4.0 license.
Ruolin Gao's study on figshare, last updated April 22, 2026, investigates dynamic changes in gut microbiota following acute high-intensity exercise in BALB/c and C57BL/6 mouse strains. The dataset, 20.8 KB in size, includes results from 16S rRNA gene sequencing of colonic content samples collected at 0, 30, and 60 minutes post-exercise. It captures strain-specific microbial diversity and functional responses related to energy metabolism and gut integrity.
Observations from the Environmental Working Group Joint U.S.-Russian Arctic Sea Ice Atlas document Arctic sea ice conditions from 1950 to 1994. The atlas synthesizes data from satellites, ice stations, icebreakers, airborne surveys, and previously classified U.S. submarine missions from 1977-1993. It was developed through a collaborative U.S.-Russian partnership in the late 1990s and includes graphical ice charts, analysis methods, and climatological data.
Hudsongouge created a dataset of 6,820 procedurally generated supervised fine-tuning (SFT) examples, last updated on 2026-06-16. It is designed for training small reasoning agents with 1β3 billion parameters. The data aims to teach models to think before answering, use tools honestly, and refuse when evidence is missing.
A multimodal dataset from a qualitative interview study on the experience of beauty, using film clips as stimuli. The dataset includes video files of the stimuli, anonymized interview transcripts, visual bodily sensation maps, and analysis spreadsheets with thematic categories. It was created by Jakob Boer and last updated on June 8, 2026.
32 normal-hearing participants completed speech-on-speech listening tasks after implicit or explicit voice training. The study, conducted by Ada Bicer, measured speech intelligibility and pupil dilation responses at three target-to-masker ratios. Results were harvested into DataverseNL and last updated on June 8, 2026.
A document classification scheme reflecting the hierarchy of records produced by an institution. It corresponds to validated document retention tables for the entity, with columns indicating sections, subsections, series, and subseries. The dataset is hosted by www.datos.gov.co and was last updated on 2026-05-18.
EweBench is the first standardized benchmark for evaluating Large Language Models on the Ewe language, a Kwa language spoken by approximately 7 million people in Togo and Ghana. It is hosted on Hugging Face by the author 'jojonocode' and was last updated on 2026-06-24. The dataset serves as a reference for assessing model performance on this specific language.
Pre-tokenized .pt files containing packed GPT-2-tokenized sequences derived from the DCLM corpus. The dataset snapshots were curated by author zhiwei555 for the paper 'Data-Constrained Language Model Pretraining: Improved Regularization and Scaling Laws'. They were last updated on June 8, 2026.
MathNet-Retrieve is a benchmark for math-aware information retrieval, created by ShadenA and last updated in June 2026. It contains 15,000 queries, each with a mathematically equivalent reformulation target provided at three difficulty tiers. The benchmark is designed to test retrieval systems on problems where the surface form is disguised while the underlying mathematical structure is preserved.
A dataset from the Plasma Science and Fusion Center Dataverse, authored by Efstratios Koukoutsis and colleagues, proposes a new dilation method for quantum implementation of non-unitary operations. The method maps non-unitary operators to isomorphic unitary matrices using biorthogonal representations. It excels for operators with eigenvalues exceeding one in absolute value and is optimal for small-dimensional cases.