Loading...
Loading...
Image-text pairs, instruction tuning, visual QA, cross-modal data, foundation model training data
1,543 datasets
WorldVQA is a benchmark dataset created by MoonshotAI to evaluate atomic vision-centric world knowledge in Multimodal Large Language Models (MLLMs). It was last updated in February 2026. The dataset decouples visual knowledge retrieval from reasoning to provide a strict measurement of a model's fundamental world knowledge.
A longitudinal and multimodal benchmark for robust drift detection in Android malware. The dataset is hosted on Kaggle, but specific details on its size, creation date, and authorship are not provided in the available metadata. Its primary purpose is to serve as a testbed for evaluating the robustness of machine learning models against concept drift in the malware domain.
Testing-multimodal is a dataset published on Kaggle. The title suggests it is intended for evaluating machine learning models that process multiple data types. The dataset's specific content, size, and origin are not detailed in the available metadata.
122,000 vision-question-answer pairs across more than 145 microscopy genera. The dataset likely contains images paired with textual questions and answers for visual question answering tasks. Published on Kaggle.
The dataset was released for the ImageCLEF 2019 challenge in September 2019. It is a medical visual question answering dataset created by researchers including Asma Ben Abacha from Microsoft. The dataset likely contains medical images paired with corresponding questions and answers.
VQA-Med is a dataset for medical visual question answering, introduced as a task at the ImageCLEF 2019 conference. The dataset was created by researchers including Asma Ben Abacha from Microsoft and was published in the CLEF 2019 Working Notes. It is designed to benchmark AI systems on answering questions about medical images.
Unsloth provides UV scripts for fine-tuning Large Language Models (LLMs) and Vision-Language Models (VLMs) using on-demand cloud GPUs via Hugging Face Jobs. The scripts handle dependency installation automatically, enabling direct execution without local setup. The dataset was last updated on February 11, -0026.
A multimodal dataset for footwear recommendation, likely containing user interaction and product preference data. It is hosted on Kaggle, but specific details about its size, structure, and creation date are not provided in the available metadata. The dataset's content and scale require verification after download.
The dataset contains pre-processed 3D assets, including voxels, rendered images, and LLM-annotated material descriptions. It is composed of 4 individual 3D asset datasets processed for multi-view rendering and voxelization. Created by NVIDIA, it is intended for research on predicting volumetric mechanical properties.
A dataset titled 'viNumMultimodalData' published on Kaggle. The title suggests it contains multiple data modalities, such as images and text. Specific details on size, origin, and creation date are unavailable.
VQA-VNF-Dataset is a Kaggle-hosted collection for visual question answering research. The dataset likely contains paired images and questions, potentially augmented with additional non-visual features. Its specific scale, authorship, and update history are not detailed in the provided metadata.
A dataset published on Kaggle with the title 'BLIP-2-HUY-UNIVL'. The title suggests it is related to the BLIP-2 model, a vision-language pre-training framework. The specific content, size, and origin are unknown from the provided metadata.
Image Caption is a dataset likely containing pairs of images and descriptive text. The dataset is hosted on Kaggle, but its specific size, source, and creation date are unknown. Columns and sample data are unavailable, limiting detailed assessment of its content and structure.
LongTVQA contains between 100,000 and 1,000,000 question-answering pairs and clip-level subtitles for long-form video analysis. Released by longvideoagent in late 2025 (Arxiv 2512.20618), the dataset facilitates research into video-grounded dialogue and temporal retrieval.
DiverseVQA2 is a dataset hosted on Kaggle. Its title suggests it is a collection for visual question answering tasks, likely containing pairs of images and associated questions with answers. The dataset's specific size, source, and creation date are not provided in the available metadata.
A dataset likely associated with the Brain Tumor Segmentation (BraTS) challenge, focusing on multimodal MRI scans. The title suggests it contains code and potentially data for processing brain tumor images, but specific details on volume, origin, and update date are unavailable. It is hosted on the Kaggle platform.
A dataset of 64,765 pages from Vietnamese textbooks for grades 1 to 12, annotated for Visual Question Answering (VQA). It was created by 5CD-AI and last updated on February 3, 2026. The description mentions a set of 388,277 detailed annotations.
parking-management-system-with-vlms-data is a dataset hosted on Kaggle. The title suggests it contains information related to parking operations and vehicle location monitoring systems. The dataset's specific content, size, and origin require verification after download due to minimal provided metadata.
Innovator-VL-RL-172K is a curated multimodal reinforcement learning dataset containing 172,000 instances released by InnovatorLab in 2026. It provides image-text reasoning pairs designed to support RLHF-style optimization for vision-language models.
A dataset titled 'CacheVQA' published on Kaggle. The name suggests it is likely a collection of image-question-answer pairs for training and evaluating Visual Question Answering models. The dataset's specific content, scale, and origin require verification after download.