Loading...
Loading...
Image-text pairs, instruction tuning, visual QA, cross-modal data, foundation model training data
1,543 datasets
OmniBrainBench is a multimodal benchmark dataset for brain imaging analysis across multi-stage clinical tasks. The dataset was created by FrankPN and is associated with a CVPR 2026 paper. Specific details on row count, column count, and data size are not provided in the input.
VLM4D is a benchmark of approximately 1,000 real-world and synthetic videos designed to evaluate spatiotemporal reasoning in Vision Language Models. Developed by Shijie Zhou and researchers at UCLA in 2025, the dataset provides curated video-text pairs to test model awareness of motion and time.
Mdpbench Vlmevalkit is a dataset published on HuggingFace by Delores-Lin. It was last updated on April 13, 2026. The dataset's title suggests it is a benchmark for evaluating vision-language models.
Human Behavior Atlas (HBA) is a multimodal benchmark aggregating between 100,000 and 1,000,000 records for psychological and social behavior analysis, published by keentomato. It standardizes diverse behavioral datasets into a single framework for training foundation models on signals like emotion, intent, and sarcasm. The collection spans text, audio, image, and video modalities to support social intelligence tasks.
VLMData is a dataset published on Kaggle, likely containing data for training or evaluating Vision-Language Models. The dataset's specific content, size, and origin are not detailed in the available metadata. Its structure and intended use must be verified after download.
Joy Captioning 20250408A contains between 100,000 and 1,000,000 image-text pairs used for the initial training of the JoyCaption Beta One vision-language model. Created by fancyfeast and updated in early 2026, the collection focuses on detailed image descriptions and visual question-answering tasks. The data includes a mix of human-written and machine-generated text, explicitly labeled for provenance.
1,000 real-world iOS mobile UI screens collected from diverse application categories on the Apple App Store. Each screen is paired with human-validated structured JSON ground truth annotations, enabling research in UI understanding and layout analysis. The dataset was created by atharparvezce and last updated on Hugging Face in February 2026.
Stephengzk published this dataset on Hugging Face on April 4, 2026. The title suggests it contains YouTube videos, likely short-form content, associated with a visual question-answering (FVQA) task. The dataset's specific content, scale, and structure require verification after download due to minimal provided metadata.
RobotInter-VQA is a Visual Question Answering dataset for robotic manipulation, developed as part of the RoboInter project. It covers generation and understanding of Intermediate Representations for task planning and is built on annotations from RoboInter-Data, with raw robot datasets sourced from DROID and RH20T. The dataset was created by InternRobotics and was last updated on February 14, 2026.
STVQA-7K is a high-quality spatial visual question answering dataset comprising 7,587 samples. It was created by hunarbatra and last updated on 2026-01-29. The dataset is fully grounded in human-annotated scene graphs from Visual Genome and is designed for training and evaluating spatial reasoning capabilities in multimodal large language models.
VLM Dynamic Model Information is a dataset related to vision-language models and their evaluation, published on the Hugging Face platform. The dataset was created by 'open-cn-llm-leaderboard' and was last updated on April 3, 2026. The specific content, scale, and structure require verification after download as metadata is minimal.
PyVision-Image-RL-Data provides between 10,000 and 100,000 reinforcement learning trajectories for training agentic vision models, released by Agents-X in February 2026. The data supports the PyVision-RL framework, which focuses on stabilizing multimodal model interactions during complex vision-language tasks.
A multimodal benchmark dataset of aligned Hematoxylin and Eosin (H&E) stained tissue patches and gene expression profiles for breast tissue, published by theislab in 2026. The dataset is designed for research at the intersection of computational pathology and genomics. It was last updated on Hugging Face in February 2026.
Multimodal Phishing Detection Dataset is a Kaggle-hosted collection for cybersecurity research. The dataset likely contains features for identifying phishing attempts across different data modalities. Its specific content, size, and creation details require verification after download.
110,000 training and 31,000 testing question-answer pairs for multimodal LLM-based autonomous driving. The dataset includes nine types of tasks across occlusion-aware perception, planning-aware prediction, and V2V-aware planning stages. It was created by eddyhkchiu and last updated on Hugging Face in February 2026.
CAMEO-Lung is a multimodal benchmark dataset containing aligned histopathology images and gene expression profiles from lung tissue. The dataset was created by theislab and last updated on Hugging Face in February 2026. It is intended for research in spatial transcriptomics and computational pathology.
MedMASLab is a benchmarking dataset for medical vision-language multi-agent systems released by qyhhhhh in March 2026. It provides standardized data and metrics for evaluating how multiple AI agents collaborate on medical visual question answering tasks. The dataset is associated with Arxiv paper 2603.09909 and focuses on multi-agent coordination in clinical AI.
CAMEO-Thymus is a multimodal benchmark dataset from the thymus. It contains aligned patches of Hematoxylin and Eosin (H&E) stained histology images and Visium spatial gene expression profiles. The dataset was created by theislab and was last updated on February 27,ๆไปฌๅ็ฐไบไธไธช้ฎ้ขใ
Digital Media Art and Visual Communication Dataset is a multimodal dataset for creative visuals, published on Kaggle. The dataset likely contains various forms of visual and communication data for analysis. Specific details on size, authorship, and update frequency are not provided in the available metadata.
5,122 training examples of Armenian dialect speech recordings from Artsakh varieties (Stepanakert, Getashen, Hadrut) paired with aligned transcriptions, split into train, validation, and test subsets. The dataset was created by DALiH-ANR and last updated on February 17, 2026.