Loading...
Loading...
Text classification, translation, QA, summarization, dialogue, sentiment analysis, language modeling, text corpora
44,799 datasets
Australian Indigenous communities adjacent to Offshore Renewable Energy (ORE) wind farm development areas are the focus of this desktop study. The work compiled information on cultural values, Sea Country plans, Indigenous Cultural Intellectual Property, and preferred engagement methods. The raw spreadsheet is withheld due to cultural sensitivities, but a synthesis is available in the NESP MaC Project 3.3 final report.
Raw participant ratings and individual dissimilarity matrices analyzed for a 2026 manuscript on measuring qualia diversity. The dataset includes CSV and H5 files totaling 1.1 MB. Kyoko Kusano and colleagues created this data to apply category-theoretic indices to psychophysical experimental results.
2009 data from Geoscience Australia details Economic Demonstrated Resources for 18 mineral commodities that increased in 2008, including black coal and iron ore, while nine others decreased. The report provides world rankings, showing Australia's resources of brown coal, nickel, and uranium are the world's largest, and analyzes resource life estimates for major commodities. It also discusses exploration expenditure trends for the 2008 calendar year.
10,000 reasoning traces from the hardest OCR2 questions, aggregated from multiple AI models. The dataset was created by JingweiNi and last updated on May 30, 2026. Each row contains step-level labels from Qwen3.5-122B and GPT-5.5 models, formatted as aligned arrays.
Geoscience Australia's 2010 report provides estimates of the country's identified mineral resources as of December 2009 for major and minor commodities. These long-term resource estimates are compared with short-to-medium term industry ore reserves and include mine production data from the Australian Bureau of Agricultural and Resource Economics and Sciences. The report also analyzes mineral exploration expenditures for 2008-09 and 2009, presenting trends and Australia's world ranking based on United States Geological Survey information.
A 2026 dataset by ShushengYang contains 500 question-answer pairs for evaluating multimodal AI models. It is a short-video companion to VSTAT, featuring 450 synthetic video clips trimmed to approximately 5 seconds each. The dataset is packaged for use with the lmms-eval framework.
Raw evaluation metrics, execution telemetry logs, and structural syntax outputs from running the Mostly Basic Python Problems (MBPP) benchmark against the StarCoder 15B base model. This partition documents scaling limits of unaligned foundational weights in conversational benchmarking loops. The dataset was authored by ShahzebKhoso and last updated on 2026-05-28.
403 papers from a scoping literature review on cold freshwater fish bioenergetics, compiled by Connor Reeve. The dataset includes two files: one containing extracted data from the reviewed papers and another detailing models from the Fish Bioenergetics 4.0 software. It was last updated on April 28, 2026.
A dataset of 3300 technical reasoning traces generated by the Kimi K2.6 teacher model. It was designed as an add-on for downstream supervised fine-tuning experiments, focusing on math, graduate-level science, coding, and debugging prompts. The dataset was authored by trjxter and last updated on June 3, 2026.
Shu Zhang's dataset on figshare contains biomechanical data from 23 youth weightlifters aged 15โ18 performing snatch lifts at 70%, 80%, and 90% of their 1RM. Data includes inertial motion capture and EMG recordings, with deep muscle forces and joint loads calculated using OpenSim. The dataset was last updated on April 14, 2026.
Australia's offshore mineral occurrences and deposits within its 200-nautical-mile exclusive economic zone and extended continental shelf. The map draws together data from published and unpublished marine research surveys and government records, showing resources like manganese nodules, heavy mineral sand, and diamonds. It was produced collaboratively by Geoscience Australia, CSIRO, and state and territory geological surveys.
Recorded state and municipal offenses from the AEGIS records management system of the Providence Police. The data is published by data.providenceri.gov and was last updated on April 3, 2026. A single case can contain multiple offenses, and the log excludes certain sensitive cases to protect victims and juveniles.
StarCoder2 3B base model evaluation on the Mostly Basic Python Problems (MBPP) benchmark. The dataset contains raw evaluation metrics, execution telemetry logs, and structural syntax outputs captured from automated conversational pipelines. It was authored by ShahzebKhoso and last updated on May 28, 2026.
Raw data for a line chart visualizing a loss function, as referenced in Figure 7 of a published research article. The dataset was authored by Ruishi Liang and published on figshare in May 2026. It is a small file of 28.1 KB.
Molecular, chemical, and morphological data for the seaweed species Eucheumatopsis isiformis, collected from March to November 2022 in Yucatรกn, Mexico, with a comparison specimen from Florida, USA. The dataset includes gene sequencing for haplotype construction, carrageenan yield and sulfate content measurements, and morphological characterizations. It was authored by Monserrat Lรณpez-Yllescas and is available under a CC-BY-4.0 license.
A mapping table linking Common Terminology Criteria for Adverse Events (CTCAE) codes for side effects to corresponding items in Quality of Life (QoL) questionnaires, specifically the EORTC QLQ-C30. The dataset was authored by Maria-Angeles Fuentes-Expรณsito and last updated on May 13, 2026. All questionnaire results referenced are from a single time point, T=12.
ADQA-Bench is the official evaluation set for the DCASE 2026 Challenge Task 5: Audio-Dependent Question Answering. It focuses on addressing textual hallucination in Large Audio-Language Models by requiring models to answer questions based on audio perception rather than linguistic priors. The dataset was authored by Harland and last updated on May 29, 2026.
Noah Atkin from Imperial College London conducted a study on nest predation in a mixed deciduous forest fragment bordering open grassland. Artificial nests containing quail and plasticine eggs were placed at ground and arboreal levels to test the edge effect hypothesis. The dataset likely contains records of predation events and nest locations.
119,877 prompt-response examples for supervised fine-tuning of Turkish language models. The dataset was created by nafie-ai and focuses on rule-based reasoning, text-grounded question answering, and safe handling of toxic inputs. It was last updated on June 2, 2026.
486 Tamil textbooks containing 20.48 million words, designed to support NLP development. The dataset is part of a larger multilingual educational corpus with over 2.6 billion words across 5,000+ subjects, created by InfoBayAI and last updated in June 2026.