Loading...
Loading...
Image-text pairs, instruction tuning, visual QA, cross-modal data, foundation model training data
1,548 datasets
EarthDial-Dataset is a curated collection of 10,000 to 100,000 evaluation-only records for remote sensing and Earth observation, released by akshaydudhane and last updated in December 2024. It benchmarks vision-language models (VLMs) on real-world satellite and aerial imagery across tasks including classification, object detection, and change detection.
C3 is a cross-view cross-modality correspondence dataset containing 90,000 paired floor plans and photographs. It covers 597 scenes with 153 million pixel-level correspondences and 85,000 camera poses. The dataset was created by kwhuang and last updated on the platform in January 2026.
T2AV-Compass is a benchmark dataset created by NJU-LINK for evaluating Text-to-Audio-Video (T2AV) generation models. It targets unimodal quality, cross-modal alignment, complex instruction following, and perceptual realism. The dataset was last updated on December 25, 2025.
A Visual Question Answering dataset derived from the BD3 Building Defect Dataset. It pairs images of building surfaces with questions and defect category answers, designed for training and evaluating Vision-Language Models. The dataset was created by author 'chandrabhuma' and was last updated on December 27, 2025.
MLLM-Driven Synthetic Multimodal dataset (MDSM) is referenced in a research context titled "The Coherence Trap: When MLLM-Crafted Narratives Exploit Manipulate". The dataset likely contains synthetic multimodal data, potentially combining text and images. Its specific size, structure, and creation details are unknown.
A benchmark suite introduced in the paper 'Same or Not? Enhancing Visual Perception in Vision-Language Models'. It contains 12,000 challenging (image, question, answer) tuples emphasizing fine-grained image understanding. The dataset is composed of six sub-benchmarks and is hosted by glab-caltech.
Aggregating Optical Coherence Tomography (OCT) scan data and human expert annotations for hydrogel-treated wounds in a mouse model. It includes raw OCT scans and corresponding tissue annotations.
Data Management and Sharing Plan for the POSE: Phase I research project, authored by Huaxiu Yao. It describes the scientific data to be generated and/or used in the research and outlines a strategy for managing and sharing project data. The specific data types, volume, and structure are not detailed.
AV-SpeakerBench is an audiovisual question-answering benchmark containing between 1,000 and 10,000 records, released in December 2024 by researcher plnguyen2908. It features trimmed segments across audio-only, visual-only, and audiovisual modalities paired with speaker-aware annotations to test fine-grained reasoning in multimodal models.
Synthetic faceβiris dataset designed for multimodal biometric research and testing. The dataset's author, size, and specific creation details are not provided. Its last update date and licensing terms are also unknown.
CCTV-Pedestrian-1K is a dataset of high-angle surveillance pedestrian images intended for training Vision Transformers (ViT) and Vision-Language Models (VLM). The dataset is hosted on Kaggle and is tagged for applications in public safety and computer vision. Specific details on the number of images, collection time, and creator are not provided in the available metadata.
Multimodal cardiac data integrates electrocardiogram (ECG), photoplethysmogram (PPG), and cardiac timing features. The dataset is hosted on Kaggle and is associated with platform tags for biology, signal processing, and medicine. Specific details on size, origin, and update frequency are not provided in the available metadata.
Aligned text, image, and audio data for cross-language AI translation tasks in Traditional Chinese Medicine (TCM). The dataset is hosted on Kaggle and is tagged as suitable for beginners. Its author, organization, and specific size are unknown.
MGI-TED provides multimodal features for analyzing toddler development and learning behavior. The dataset's author, organization, and specific scale are currently unknown. It is hosted on Kaggle, but details on its collection method and temporal coverage are not provided.
NVIDIA released this collection of approximately 9 million vision-language samples in late 2025. It focuses on document understanding, visual question answering, and video-to-text tasks across multiple languages.
S-Chain is a multimodal medical dataset developed by Khai Le-Duc and a multi-institutional research team, last updated in December 2025. It provides structured visual chain-of-thought reasoning paths for clinical tasks across eight languages, including English, Arabic, and Japanese. The data supports a wide range of tasks from object detection to multilingual text generation.
WildfireVLM is a dataset hosted on Kaggle, likely focused on visual and language modeling for wildfire events. The platform tags suggest it contains geospatial and computer vision data, potentially for benchmarking deep learning models. Its specific content, size, and creation details require verification after download.
HAIM Multimodal Full Dataset is hosted on Kaggle. The dataset's specific content, size, and creation details are not provided in the available metadata. Its title suggests it contains multiple data modalities, likely for machine learning research.
Image and text question-answer pairs representing 90 distinct animal species. It provides structured data for Visual Question Answering (VQA) tasks, focusing on the identification and description of fauna.
A synthetic electronic health record dataset integrating text notes and time-series vital sign data. The dataset is designed for healthcare predictive research, specifically HPR. It was created by an unknown author and published on Kaggle, with no information on its size or last update.