Loading...
Loading...
Image-text pairs, instruction tuning, visual QA, cross-modal data, foundation model training data
1,539 datasets
furproxy provides a collection of captions for furry-themed images sourced from platforms like e621, CivitAI, and booru sites. The dataset contains approximately 7,500 captions, with at least 70% of the complex scenes being human-reviewed and edited. Captions were generated using Gemini 3 Flash and processed through a pipeline involving multi-crop passes and combination.
A subset of Google DeepMind's RoboVQA dataset, re-hosted for loader compatibility. Human-annotated long-horizon robotics video question-answering data across three embodiments, used to train the allenai/Molmo2-ER-4B model. The upstream dataset is described in the paper 'RoboVQA: Multimodal Long-Horizon Reasoning for Robotics' (arXiv:2311.00899).
CMDPAD challenges the static personality assumption by providing dynamic utterance-level scores for the Big Five personality traits. The dataset moves beyond emotion recognition to predict the emotional trajectory of the next interaction turn. It was authored by HensonXie and last updated on Hugging Face in May 2026.
Kun-Xiang created the PhysRL collection to accompany the SeePhys Pro research paper. The dataset includes the full PhysRL-38K corpus and a vision-necessary subset of PhysRL-8K, used for studying multimodal reasoning in physics. It was last updated on HuggingFace on 2026-05-13.
DIM-Edit dataset accompanies the DIM-4.6B-T2I and DIM-4.6B-Edit models released in October 2025. The dataset supports research on rebalancing designer and painter roles in unified multimodal models for image editing. It was created by Ziyun Zeng, David Junhao Zhang, Wei Li, and Mike Zheng Shou, with the associated paper accepted to ICLR 2026.
Wiki-CoE is a multimodal question-answering dataset for evaluating visual reasoning and evidence localization. Each example pairs a natural-language question with one or more Wikipedia page screenshots, asking models to return both an answer and an explicit chain of supporting evidence. The dataset was created by PeiyangLiu and was last updated on the Hugging Face platform in May 2026.
ChartInt is a multimodal chart dataset designed for tasks such as chart reconstruction, editing, style transfer, interaction editing, and data updates. The dataset, created by xilinghuiye, contains 2,905 rows in its train split and was last updated on May 3, 2026. It is packaged as a datasets-compatible Parquet file for direct viewing on Hugging Face.
Youcheng Wang's 2026 study integrates survey data from 519 non-local visitors in Macao with street-level visual indicators. The multimodal analysis examines relationships among destination image, perceived value, perceived risk, satisfaction, attitude, and responsible tourism behavioral intention.
Lost On Campus benchmark evaluates Embodied Scene Representation (ESR) of Vision-Language Models in large-scale real-world outdoor 3D environments reconstructed by 3D Gaussian Splatting. It introduces a unified reasoning-action evaluation framework integrating diagnostic QA and closed-loop interactive navigation under multimodal instructions. The dataset is authored by lost-on-campus-project and was last updated on 2026-05-07.
Six urban areas in North Carolina impacted by Hurricanes Matthew and Florence are covered by this dataset. It provides binary flood extent annotations paired with building footprints and road networks, derived from high-resolution (1.5 cm to 25 cm) imagery. The data is structured into 10 spatial divisions and formatted for both deep learning model training and traditional GIS analysis.
A multimodal connectomic analysis of Amyotrophic Lateral Sclerosis integrates cortical thickness-based structural covariance networks, diffusion MRI tractography, and resting-state and task-based functional MRI. The study employs a 104-node parcellation scheme based on the Desikan-Killiany atlas to examine structure-function coupling and network reorganization in ALS patients versus matched controls.
A multimodal connectomic analysis of Amyotrophic Lateral Sclerosis integrates cortical thickness-based structural covariance networks, diffusion MRI tractography, and resting-state and task-based functional MRI. The study employs a 104-node parcellation scheme based on the Desikan-Killiany atlas to examine structure-function coupling in ALS patients and matched controls. It reports preserved global network topology but selective reorganization within motor and interhemispheric pathways.
A study of structural and functional brain connectivity in Amyotrophic Lateral Sclerosis (ALS) patients and matched controls. The analysis employed a 104-node brain parcellation scheme, integrating cortical thickness, diffusion MRI tractography, and resting-state and task-based functional MRI. Graph-theoretical metrics were derived to examine cross-modal structureโfunction correspondence.
A multimodal connectomic analysis integrating cortical thickness, diffusion MRI, and resting-state and task-based functional MRI from ALS patients and matched controls. The study employed a 104-node brain parcellation scheme and graph-theoretical metrics to analyze structureโfunction coupling. The dataset, authored by Vijay Renga and last updated in March 2026, is shared under a CC-BY-4.0 license.
MedHorizon is a long-context medical video benchmark created by DBD123 and last updated on 2026-05-07. It contains 340 full-procedure clinical videos paired with 1,253 multiple-choice question-answer pairs. The benchmark is designed to evaluate multimodal models on tasks requiring sparse evidence retrieval and multi-hop reasoning across long videos.
EmoRoad provides anonymized clip and raw data capturing psychological, physiological, and behavioral human-subject responses in varied driving conditions. The 3.3 GB dataset was created by RCFCM Hong Kong and released as open access in April 2026. It integrates multiple sensor modalities to study driver states.
MedHorizon provides 340 full-procedure clinical videos paired with 1,253 multiple-choice questions for evaluating multimodal AI models. The benchmark emphasizes two challenging properties: extremely sparse evidence retrieval and multi-hop reasoning across observations distributed throughout lengthy procedures. It was created by mlvbench-review and last updated on Hugging Face in May 2026.
A structured dataset of real-world VR forklift operation tasks, capturing aligned state, action, and outcome trajectories. It contains 384,950 timesteps at 50 Hz across 9 training episodes, created by fl-simulators and last updated on 2026-04-21. The data includes explicit intent, task structure, and reward signals for success, failure, and safety events.
Chong Liu authored a comparative analysis of multimodal fusion methods, published on figshare. The dataset is a 9.5 KB Excel file last updated on April 24, 2026.
15,282 behavioral annotations of LLM and VLM reasoning traces were collected by neulab. The dataset covers responses from 15 models across 6 benchmarks, with each row containing correctness and a JSON-encoded behavioral annotation. It was last updated on 2026-05-08.