Loading...
Loading...
Image-text pairs, instruction tuning, visual QA, cross-modal data, foundation model training data
1,539 datasets
EmoRoad is a multimodal dataset containing psychological, physiological, and behavioral responses collected from human subjects in diverse driving scenarios. The dataset includes identifiable facial video recordings, EEG signals, eye tracking, and other measures, requiring a signed Data Usage Agreement for controlled access. It was published by RCFCM Hong Kong and last updated in April 2026.
EmoRoad is a multimodal dataset containing psychological, physiological, and behavioral data from human subjects in diverse driving scenarios. The dataset includes identifiable facial video recordings, EEG signals, and other measures, requiring a Data Usage Agreement for access. It was created by Stephen Jia Wang and made available in 2026.
A subset of 20,000 video question-answer pairs from the LLaVA-Video-178K dataset, hosted by internlm. The dataset was last updated on 2026-05-22. It provides relative file paths to video clips, likely intended for training or evaluating multimodal AI models.
53,202 instruction-tuning examples were curated by the Trendyol Security Team for training defensive security AI assistants. The dataset covers over 200 specialized cybersecurity domains, including cloud-native threats, AI/ML security, and quantum computing risks. It was expanded from an earlier version of 21,000 rows and last updated on May 17, 2026.
A curated set of 10,000 high-resolution image-caption pairs for training and research. Images were sourced from Pexels, and captions were generated with JoyCaption before being cleaned for use. The dataset was created by edwixx and was last updated on 2026-05-16.
Autobusu Stoteles is a multimodal dataset containing 102 PNG screen images of bus stops. The dataset includes detailed Lithuanian captions for each image, likely intended for visual language model tasks. It was created by author dzeveckij and last updated on May 20, 2026.
A multimodal dataset for material recognition, likely containing images and spectral data of fibers. Visual images were acquired using a field-emission scanning electron microscope (FESEM), Raman spectra using a Raman spectrometer, and near-infrared spectra using an NIR spectrometer. The dataset is 179.2 MB in size, authored by Weiqin zhu, and was last updated on 2026-05-01.
India-Centric Image–Text Pairs Dataset is a multilingual collection of document images paired with OCR transcriptions. It includes samples from 22 Indian languages, such as Bengali, Hindi, Kannada, Malayalam, Marathi, Sanskrit, Tamil, and Telugu. The dataset was created by MILA: MULTILINGUAL INDIC LANGUAGE ARCHIVE and last updated on 2026-05-07.
2,000+ hours of multimodal human sensorimotor data are collected weekly, making this the largest dataset of its kind. The dataset is produced by Human Archive, a project backed by Y Combinator and engineers from OpenAI, BAIR, SAIL, and other organizations. The dataset page was last updated on 2026-05-18.
NSD-VQA is a large-scale visual question answering benchmark for studying the decoding of visual and semantic information from human fMRI responses to natural images. It is built from the Natural Scenes Dataset (NSD) and provides automatically generated question-answer annotations grounded in NSD images. The dataset was created by mcosarinsky and was last updated on 2026-05-24.
Posttraining-RFM-RSS2026 provides real-robot bimanual manipulation trajectories for three benchmark tasks. Data was collected on a bimanual YAM follower teleoperated by a GELLO leader arm, with timestamp-aligned frames across joint state and action. The dataset was released for the RSS 2026 Workshop & Challenge on Post-training for Robotics Foundation Models.
PRISM Public SFT Data is a collection of public multimodal demonstrations used for supervised fine-tuning in the PRISM project. The project studies distributional drift in the post-training pipeline for large multimodal models. This dataset serves as the broad SFT initialization source before distribution alignment and RLVR stages.
Yekai Wang's study integrates cardiopulmonary, neuromuscular, and biomechanical data from 20 healthy collegiate male athletes performing high-intensity treadmill exercise. The dataset includes gas exchange, heart rate, perceived exertion, sEMG metrics from four leg muscles, and plantar kinetic measures from in-shoe sensors. It provides a structured framework for analyzing fatigue-related adaptations beyond metabolic indicators.
MQUD contains 1,250 figure-grounded inquisitive questions sourced from scientific papers. Each example pairs a scientific figure with paper context, a question, an extractive answer, and question type. The dataset was created by author 'lingchensanwen' and was last updated on Hugging Face in May 2026.
DepthVLM-Bench is a unified metric depth estimation benchmark for vision-language models. It provides diverse indoor and outdoor scenes with metric depth annotations in a VLM-compatible format. The dataset was created by JonnyYu828 and was last updated on May 18, 2026.
318,000 agent trajectories for instruction tuning of large language models in software engineering. The dataset was synthesized using the Qwen3-Coder-480B-A35B-Instruct model and collected via the OpenHands framework. NVIDIA authored the dataset, which was last updated on May 5, 2026.
26 subjects performed breath-holding, paced-breathing, and mild hypercapnia tasks while wearing a low-cost multimodal wearable. Thien Nguyen collected this 53.3 MB dataset, which is stored in MAT files and was last updated in May 2026. The data is intended to support research on vital signs and tissue oxygen saturation monitoring.
A JSON-LD knowledge graph encoding the concept layer of the Agent Knowledge Cycle (AKC), a six-phase bidirectional growth loop for agent behavior and operator judgment. The dataset is a mirror of the graph.jsonld file from the AKC GitHub repository, provided for LLM training pipelines. It was uploaded by Shimo4228 and last updated on 2026-05-18.
GuideDog is a real-world egocentric multimodal dataset for accessibility-aware guidance for blind and low-vision users. It contains 22,084 image-description pairs, including 2,106 human-verified gold and 19,978 VLM-generated silver annotations, collected from real walking videos across diverse cities. The dataset accompanies an ACL 2026 paper and includes derived multiple-choice subsets.
A research dataset from Harvard Dataverse, last updated 2026-05-26, aiming to improve myofascial pain management. The project, led by Siddhartha Sikdar, develops imaging biomarkers to distinguish healthy and diseased soft tissues like muscle, connective tissue, nerves, and blood vessels. It compares tissue changes in individuals with myofascial pain to those without pain.