Loading...
Loading...
Image-text pairs, instruction tuning, visual QA, cross-modal data, foundation model training data
1,541 datasets
29,283 pairwise human preference labels comparing human motion quality across four frontier video generation models, released by Datapoint AI in February 2026. The dataset captures evaluations from 4,349 unique annotators focusing on three specific quality dimensions of AI-generated video.
Image-caption-project is a dataset from Kaggle. Its title suggests it contains pairs of images and textual descriptions. The dataset's specific scale, origin, and update date are unknown.
Kaggle hosts this processed dataset derived from the China Physiological Signal Challenge 2018 (CPSC2018). The title indicates it contains electrocardiogram (ECG) data that has been processed and is multimodal in nature. The original CPSC2018 challenge focused on ECG signal classification and analysis.
Kaggle hosts a dataset titled 'animals10-10k-image-caption-dataset'. The dataset likely contains 10,000 images of animals paired with descriptive text captions. Its specific source, creation date, and author are unknown from the provided metadata.
VLMSafe-420 consists of 420 multimodal counterfactual pairs across 38 safety categories, developed by ArthT and updated in March 2026. The data is designed for mechanistic interpretability research to identify and analyze safety circuits within Vision-Language Models.
DeepVision-103K contains 103,000 multimodal records focused on verifiable mathematical reasoning, released by skylenage in February 2026. It utilizes image-text pairs to improve the efficiency of vision-language models in solving complex logic problems.
Nemotron Rl Instruction Following Calendar V2 is a multi-turn conversation dataset for understanding natural language scheduling constraints and inferring conflicts. It contains events with specific duration and timing constraints mentioned in random conversational order. The dataset was created by NVIDIA and last updated in March 2026.
Flickr8k Tamil Image Caption Dataset provides Tamil language captions for images, intended for image captioning and vision-language research. The dataset's author, organization, size, and update history are not specified in the provided metadata. It is hosted on the Kaggle platform.
MM-IMDb combines visual and textual data for movies, likely sourced from IMDb. The dataset is designed for multi-label genre classification tasks. Its author, organization, and exact size are unknown.
DGM4MultiModalDeepFake is a dataset hosted on Kaggle. The dataset's title suggests it contains multimodal data likely intended for deepfake detection research. The specific content, size, and origin are not detailed in the provided metadata.
PersonaVLM is a dataset supporting the development of personalized multimodal agents with long-term memory capabilities. The framework was created by ClareNie and the associated paper was accepted for CVPR 2026. The dataset is hosted on Hugging Face and was last updated in March 2026.
BLIP-Base is a pre-trained model for vision-language understanding tasks, hosted on Kaggle. The specific dataset content, such as the number of image-text pairs or the training corpus, is not detailed in the provided metadata. Its availability on a major data science platform suggests it is intended for AI/ML practitioners working with multimodal data.
A set of final model weights for a fine-tuned LLaVA (Large Language-and-Vision Assistant) model, likely using LoRA (Low-Rank Adaptation) techniques. The dataset is published on Kaggle, but its specific content, size, and creation details are not provided in the available metadata. The title suggests it contains parameters for a vision-language model, potentially for tasks like image captioning or visual question answering.
LLaVA-LoRA-Oracle-Final appears to be a dataset for fine-tuning multimodal large language models. The title suggests it is likely associated with the LLaVA (Large Language-and-Vision Assistant) project and involves LoRA (Low-Rank Adaptation) techniques. Published on Kaggle, its specific content and scale require verification after download.
Zsfood Vlm Des is a dataset published on HuggingFace by author LTaiQin. The title suggests it contains data related to food, likely for vision-language model tasks. The dataset was last updated on April 21, 2026.
Comprising 1,235,432 Midjourney v6 images paired with captions generated by three different Vision Language Models (VLMs), released by Photoroom in March 2026. It provides a large-scale collection of AI-generated art with multi-perspective textual descriptions from LLaVA, Gemini Flash 1.5, and Qwen3 VL 8B. The data is formatted in Parquet for efficient processing in machine learning workflows.
50,000 diverse images form a subset for multimodal retrieval and vision-language research. The dataset is sourced from the YFCC100M collection. Its specific creation date, author, and update frequency are unknown.
FAMMA appears to be a multimodal dataset, likely containing multiple data types such as images, text, or audio. The dataset is hosted on Kaggle, but its specific contents, size, and creation details are not provided in the available metadata. Its origin, author, and the time period it covers are currently unknown.
OmniVideo-R1 provides between 100,000 and 1,000,000 preprocessed records for audio-visual reasoning, published by jankin123 in March 2026. The collection supports a two-stage training framework for multimodal models, specifically focusing on Query-Intensive (QI) grounding and modality attention.
Open-Personix is a person-centered multimodal dataset of fewer than 1,000 records maintained by Poralus and updated in March 2026. It provides structured JSON entries containing relative image paths, natural-language captions, and descriptive person-specific annotations.