Loading...
Loading...
Image-text pairs, instruction tuning, visual QA, cross-modal data, foundation model training data
1,551 datasets
A dataset titled 'Multimodal_csv' is available on Kaggle. The dataset's specific content, size, and origin are not detailed in the provided metadata. Further verification is required to confirm the exact nature and composition of the multimodal elements.
A dataset titled 'ttv_sp_llava_final' published on Kaggle. The title suggests it is a final version of data related to the LLaVA (Large Language-and-Vision Assistant) model, likely containing multimodal content for vision-language tasks. Metadata is minimal; the specific content, size, and origin require verification after download.
GVLM-Data is a dataset hosted on Kaggle. The dataset's title suggests it is likely related to General Vision-Language Models. Its specific content, size, and origin are not detailed in the available metadata.
MicroVQA++ is a three-stage, large-scale and high-quality microscopy visual question answering corpus derived from biomedical imaging sources. The dataset, created by author ieellee and last updated on 2025-12-14, is designed to address the scarcity of training data for scientific reasoning in microscopy with multimodal large language models.
A dataset for training vision-language models, created by NVIDIA. The dataset page includes a version history with updates from August to September 2025. The dataset was last updated on the platform on 2025-10-22.
1600 voice samples are paired with 304 voice features and 4 demographic variables for diabetes prediction. The dataset is hosted on Kaggle and includes platform tags suggesting a focus on deep learning applications and synthetic data. Its multimodal nature combines audio signal processing with demographic information.
PixelProse contains 16,896,214 image-caption pairs featuring dense synthetic descriptions generated by Gemini 1.0 Pro Vision. Released in 2024 by researchers at the University of Maryland (tomg-group-umd), the collection provides detailed textual representations for images sourced from CommonPool and CC12M.
Rbyte provides multimodal datasets for spatial intelligence and robotics, released by yaak-ai and updated in February 2026. The collection utilizes MCAP and TensorDict formats to facilitate high-performance spatial computing and integration with PyTorch and Polars.
Penn State ScholarSphere provides 9,363 open access books with page images and bibliographic metadata extracted from MARC21 records. The dataset was curated by author biglam for training and evaluating Vision Language Models on automatic metadata extraction from scholarly monographs. It was last updated on October 16, 2025.
COCO-QA Vietnamese is a fully translated Vietnamese version of the popular COCO-QA dataset for Visual Question Answering (VQA) tasks. It contains over 117,684 image-based question-answer pairs translated into Vietnamese, with answers limited to one word. The dataset was created by ThucPD and last updated on June 8, -2025.
UNO-Bench is a unified benchmark for exploring compositional relationships between uni-modal and omni-modal capabilities in AI models. The dataset was created by meituan-longcat and was last updated on December 4, 2025. It is accompanied by released evaluation scripts and a scoring model named UNO-Scorer-Qwen3-14B.
Open-Orca's SlimOrca Dedup is a dataset of 363,000 unique instruction-response examples derived from the SlimOrca collection. It was created by removing RLHF instances and applying minhash and Jaccard similarity techniques for deduplication. The dataset was last updated on Hugging Face on May 19, 2025.
Approximately 100,000 image-caption pairs form this dataset for training image-to-text models. It was created by prithivMLmods and last updated on August 28, 2025. The dataset emphasizes long-form captions covering a wide range of real-world and artistic scenes.
A dataset titled 'Multimodal1' published on Kaggle. The title suggests it contains multiple data modalities, such as text, images, or audio, likely intended for AI model training. The author, organization, size, and specific content are unknown.
Kaggle hosts a dataset titled 'multimodal', which likely contains data from multiple modalities such as text, images, or audio for machine learning tasks. The dataset's specific content, size, and creator are not detailed in the available metadata. Its last update date and other descriptive details are unknown.
A dataset for Visual Question Answering tasks, likely containing pairs of images and questions with corresponding answers. It is hosted on Kaggle. The specific size, creation date, and authorship are unknown.
MathVision-Wild provides 1,000 to 10,000 photographic versions of the MathVision test dataset captured in diverse physical environments. Created by MathLLMs and updated in late 2025, it transitions digital math problems into real-world visual contexts to evaluate Vision Language Model (VLM) performance.
MinishLab released Semhash in January 2026 to provide a framework for fast multimodal semantic deduplication and filtering. The project utilizes model2vec and vicinity-based hashing to identify near-duplicate records across text and image datasets.
HuggingFaceM4 released FineVision in October 2025, a collection of 24.3 million samples featuring 17.3 million images and 88.9 million conversational turns. The dataset is designed for training open Vision-Language Models and contains 9.5 billion answer tokens.
SLAKE contains 642 medical image samples with multi-task annotations, curated by Voxel51 based on research published in 2021 (Arxiv 2102.09542). It provides a specialized dataset for medical visual question answering and computer vision, featuring labels for classification, detection, and segmentation tasks.