Loading...
Loading...
Text classification, translation, QA, summarization, dialogue, sentiment analysis, language modeling, text corpora
43,991 datasets
Evaluation reports from Global Affairs Canada serve as a practical management tool for reviewing program performance. The collection includes reports for the Canadian International Development Agency's Regional Inter-American Program spanning the 2004-2005 to 2009-2010 fiscal years. Each report aims to improve the design and implementation of upcoming international development initiatives.
AISHELL8-RealScene is a multimodal dataset of conversational Mandarin speech recorded in real-world settings. It contains 102.19 hours of audio from 171 foreground speakers across 5 different locations. The dataset was created by SMIIP-lab and includes synchronized near-field and 8-channel far-field audio with multi-view facial video.
Raw data supporting figures and tables for the paper 'Trp521 oxidation affects FtsH2 stability and its role in PSII repair.' The 35.5 MB XLSX file was authored by ZHANG JINGZHI and last updated on May 30, 2026. It is shared under a CC-BY-4.0 license on figshare.
A study by Evania Fasya involving 102 participants with varied public speaking anxiety levels. Participants delivered speeches to virtual audiences of varying sizes and attitudes, with physiological signals, speech characteristics, subjective stress, and audience evaluations measured. The dataset was last updated on June 15, 2026.
GPT-5.5 Agent contains raw agent trace files generated by the teich platform from TeichAI. The dataset includes 89 JSONL files with metadata indicating the underlying model was GPT-5.5. It was uploaded by AletheiaResearch and last updated on June 23, 2026.
December 3β13, 2025 survey of generative AI users in South Korea aged 14β69 who used Gen AI at least once per week. The data comprises 400 valid survey responses collected via a stratified panel sampling procedure by Macromill Embrain, with IRB approval and participant consent. It was authored by Hongjin Shim and published on Harvard Dataverse.
Sijin Xia uploaded a research dataset on 2026-04-28 detailing the in vitro adaptation of Coxsackievirus A6 (CVA6) in Vero cells. The dataset likely contains phenotypic and transcriptomic comparisons between two recombinant virus strains, rV10 and rV45, generated from passage 10 and passage 45. The 2.0 MB file is a DOCX document describing viral growth, cytopathic effects, receptor interaction, pathogenicity, and host transcriptomic responses.
Experimental data from figshare evaluates chemical amendments for enhancing cadmium accumulation in sweet sorghum. The dataset, authored by Juan Li and last updated in April 2026, likely contains measurements from trials applying ferric chloride, citric acid, and polyaspartic acid. It supports the development of a practical strategy for phytoremediation combined with bioethanol production.
495 attack-level evaluation rows generated by the SEIGE framework from a local Ollama model sweep. The dataset includes attack prompts, model responses, pass/fail outcomes, risk scores, and metadata for security analysis. It was created by user 'tmesttttttttt' and last updated on June 5, 2026.
27 public-domain educational texts published before 1930 form this supervised fine-tuning dataset. The texts, sourced from the Internet Archive, span natural science, history, law, philosophy, and grammar, and are written in a question-and-answer catechism format. The dataset was created by zachnorton03 and last updated on June 19, 2026.
A curated collection of open-source datasets for distilling knowledge from Anthropic's Claude models. The repository contains at least two unified subsets, including 'claude-sonnet-4.6-120000x' with 119,446 samples and 'claude-opus-4.6-10000x' with 9,633 samples. The data was aggregated and formatted by ansulev, with credit to original creators, and was last updated on 2026-06-16.
Approximately 640,000 Shell scripting code samples compiled by ajibawa-2023, last updated on 2026-06-20. This large-scale corpus is stored in JSON Lines format and is designed to support research in code intelligence and automation. The dataset's primary purpose is to facilitate large language model pretraining and software engineering tasks.
McGill-NLP provides generated article and summary audio for English-centric multilingual directions. The dataset includes audio files and metadata for language pairs such as Amharic-English, Arabic-English, Bengali-English, and Chinese-English. It was last updated on June 17, 2026.
AURORA study data from 385 men and women tracks C-reactive protein levels and pain severity after traumatic stress exposure. Lauren A. McKibben's research, last updated in 2026, reveals a sex-dependent relationship where peritraumatic CRP predicts chronic pain outcomes in men but not women. The dataset likely contains longitudinal biomarker measurements and pain questionnaire responses collected from emergency department patients.
A PDF document explores the Japanese concept of 'ma' and its influence on fashion design. The work by Vivien Jiaqian Zhu connects Greek philosophy, avant-garde Japanese designers, the Paris fashion scene, and Chinese classics like The Dream of the Red Chamber. It was last updated on 2026-05-16 and is licensed under CC-BY-4.0.
A collection of CRef/SRef LoRA triplets exported from the 0426 diffusion training data. Each training example contains three images: a content reference, a style reference, and a target image generated from the combined condition. The dataset was created by Blue2Giant and last updated on June 17, 2026.
54 full-length songs were dynamically rated for valence and arousal by listeners recruited via Amazon Mechanical Turk. The MERP dataset was created by amaai-lab for music emotion recognition research. The dataset page was last updated on 2026-06-20.
A curated collection of publicly available geospatial datasets from the Geospatial Information Center that specify a Creative Commons license. The data includes 3D city and building models, elevation and terrain data, vector datasets for administrative boundaries and land use, and disaster-related geospatial information. The collection is hosted on AWS S3 and aggregated by the organization AIGID.
INDICADORES RESOLUCIΓN 1522 DE 2013 is a quarterly report of indicators from Resolution 1552, measuring appointment opportunity for general medicine, dentistry, and specialized medicine in Colombia. The dataset is hosted on the Colombian open data portal www.datos.gov.co and was last updated on 2026-05-27. It contains columns such as MAX DIAS ESPERA, MIN DIAS ESPERA, HORAS PROMEDIO, and PERIODO.
Qualitative data from focus groups and key informants examines women's attitudes towards oral pre-exposure prophylaxis for HIV prevention. The 650.8 KB dataset, authored by Grace Kenyonga and last updated in May 2026, contains manually analyzed thematic findings. Women's views on PrEP include empowerment for health control and concerns about promiscuity and partner conflict.