Loading...
Loading...
Image-text pairs, instruction tuning, visual QA, cross-modal data, foundation model training data
1,543 datasets
A dataset likely containing images of plants paired with questions and answers, focusing on disease identification. It is hosted on Kaggle, but the specific collection date, author, and total volume are unknown. The dataset's name suggests it is a second version of a visual question-answering resource for the PlantVillage domain.
A dataset titled 'multimodal_concise' is hosted on Kaggle. The dataset's content and structure are inferred from its title, which suggests it contains multiple data modalities in a concise format. No further metadata is available to confirm its size, origin, or specific contents.
SIQA TrainSet provides between 100,000 and 1,000,000 records for Scientific Image Quality Assessment, released by the SIQA organization in early 2026. The data is structured for multimodal training across two specialized tasks: SIQA-U for subjective assessment and SIQA-S for structural assessment.
The Dendritic Consortium provides a multimodal dataset integrating calcium and voltage imaging, electrophysiology, electron microscopy, proteomics, and computational models. It focuses on Baz1a pyramidal neurons in the mouse primary visual cortex (V1). The data is hosted on AWS Open Data with no restrictions on use.
A dataset named 'vqacache' published on Kaggle. The title suggests it is related to Visual Question Answering, a multimodal AI task combining images and text. No further metadata, such as size, columns, or authorship, is provided.
Geoguessr Vlm Dataset is a multimodal dataset hosted on HuggingFace by author OLEGator228. The platform tags suggest it likely contains satellite imagery and location data for vision-language model tasks. It was last updated on 2026-04-08.
CASTLE2024 provides time-aligned sensor and video data captured from 10 participants over four days in a controlled environment. The dataset, created by the CASTLE-Dataset organization, is designed for research in lifelogging and multimodal retrieval. It was last updated on the Hugging Face platform in February 2026.
A dataset titled 'English Nemotron Sft Instruction Following Chat V2 Pretrain' was published on the Hugging Face platform by the user 'escavador'. The dataset's specific content, size, and structure are not detailed in the provided metadata. Its last recorded update was on 2026-03-31 15:47:30.
IshiharaColorBench is a benchmark designed to measure the pure color perception capability of Large Vision-Language Models (LVLMs). The dataset's specific size, format, and authorship are unknown. It is hosted on Kaggle.
VisualWebInstruct is a large-scale multimodal instruction dataset containing approximately 900,000 question-answer pairs. It consists of 40% visual QA pairs linked to 163,743 unique images and 60% text-only QA pairs, designed to enhance vision-language reasoning. The dataset was created by TIGER-Lab and was last updated on February 1, 2026.
51,000 Kazakh question-answer pairs designed for instruction tuning of large language models. The dataset covers more than 20 domains, making it suitable for building general-purpose Kazakh language AI assistants. It is hosted on Kaggle and is formatted for immediate use in fine-tuning tasks.
Drishti-VLM-Data is a dataset published on Kaggle. The title suggests it contains data for training or evaluating vision-language models. The dataset's specific content, size, and origin are not detailed in the available metadata.
ViFoodVQA is a benchmark dataset for visual question answering tasks. The dataset likely contains images of Vietnamese food paired with questions and answers. It is hosted on Kaggle, but details about its size, creation, and update history are unknown.
A dataset associated with the ANN-2026 multimodal challenge, likely containing information related to crowdfunding campaigns and their outcomes. The dataset is hosted on Kaggle, but its specific contents, scale, and origin are not detailed in the available metadata. Further inspection after download is required to confirm the data's structure and features.
KitaKo Multimodal Dataset contains 110,000 images paired with 548,000 parallel captions. The captions are provided in three languages: English, Filipino, and Taglish. The dataset's author, organization, and last update date are unknown.
A dataset associated with the ICCV 2025 paper 'SENTINEL: Mitigating Object Hallucinations via Sentence-Level Early Intervention'. The dataset was created by author psp-dada and last updated on February 11, 2026. It is designed to address the problem of fabricated content in multimodal large language models.
A finetuned version of the BLIP model on the COCO dataset, likely containing image-text pairs for action captioning tasks. The dataset is hosted on Kaggle, but its specific size, columns, and creation details are unknown. Its content and scale require verification after download.
A dataset titled 'internvl24b-vlm-cia' is hosted on Kaggle. The name suggests it is likely a multimodal dataset for training or evaluating vision-language models. No further metadata is available to confirm its size, origin, or specific content.
Resized-Kvsair VQA is a dataset for visual question answering tasks, likely containing pairs of images and corresponding questions. It is hosted on Kaggle, a popular platform for data science competitions and datasets. The dataset's specific content, size, and creation details are not provided in the available metadata.
VLM Data is a dataset hosted on Hugging Face by author INV-WZQ. The dataset was last updated on April 1, 2026. Its specific content and scale are not detailed in the available metadata.