Loading...
Loading...
Image-text pairs, instruction tuning, visual QA, cross-modal data, foundation model training data
1,560 datasets
A multimodal dataset from HuggingFace, authored by med-vlrm and last updated on 2025-06-29. The title suggests it involves medical visual question answering (VQA) using a vision-language model (VLM) on PubMed Central (PMC) images, with reasoning processes from GPT-4O. The dataset's specific size, structure, and content are not detailed in the provided metadata.
Dynvqa is a multimodal dataset hosted on Hugging Face, authored by xandery and last updated on 2025-08-24. The dataset likely contains image-text pairs for question answering tasks, as suggested by its platform tags. The specific number of samples, column structure, and data collection methodology are not detailed in the available metadata.
A multimodal dataset from huggingface, created by med-vlrm and last updated on 2025-06-29. The platform tags suggest it contains medical vision-language data, likely involving images and text processed with GPT-4O for reasoning tasks. The specific content, scale, and structure require verification after download.
Llava 3D Data is a multimodal dataset published on HuggingFace by author ChaimZhu. The dataset was last updated on July 11, 2025. Its specific content and scale are not detailed in the available metadata.
A multimodal dataset likely designed for instruction-tuning of Vision-Language Models (VLMs). The dataset was published on HuggingFace by Hirai-Labs and was last updated on April 4, 2025. Its specific content and scale are not detailed in the available metadata.
OK-VQA_train is a dataset for visual question answering tasks, likely containing image-question-answer pairs. The dataset was created by Multimodal-Fatima and was last updated on March 23, 2023. Specific details on the number of samples, data format, and license are not provided in the available metadata.
558,000 image-text pairs form this dataset for vision-language instruction tuning, curated by the lmms-lab research group. It was last updated in May 2024 and is hosted on Hugging Face. The data is specifically designed for training and evaluating multimodal AI models that process both visual and textual information.
The Wikipedia-based Image Text (WIT) Dataset contains 37.6 million entity-rich image-text examples paired with 11.5 million unique images across 108 Wikipedia languages. It was created by keshan for pretraining multimodal machine learning models and was last updated in August 2021.
A blind evaluation dataset of high-quality, diverse, human-written instructions with demonstrations. The dataset was created by HuggingFaceH4 and last updated on February 28, 2023. It is intended for use in step 3 evaluations within a Reinforcement Learning from Human Feedback pipeline.
Japanese Hh Rlhf 49K is a dataset derived from kunishou/hh-rlhf-49k-ja, excluding examples where ng_translation equals 1. The dataset was authored by fn-aka-mur and last updated on Hugging Face in May 2023. Its specific size and row count are not detailed in the provided metadata.
The Public Multimodal Dataset (PMD) contains 70 million image-text pairs with 68 million unique images. It was introduced in the FLAVA paper and aggregated from publicly-available sources including Conceptual Captions, WIT, Localized Narratives, RedCaps, COCO, SBU Captions, Visual Genome, and a subset of YFCC100M.
Cvqa is a dataset uploaded to Hugging Face by author 'davidanugraha'. The dataset was last updated on June 30, 2025. Its specific content, size, and structure are not described in the available metadata.
A multimodal dataset likely containing information related to chemical entities, as suggested by the title 'Chebi'. The dataset was published on huggingface by the author jablonkagroup and was last updated on 2025 05 11. The platform tags indicate it contains both image and text modalities.
32 features across 5 categories like Environment and Damage annotate public videos of natural disasters. The dataset was used for the TRECVID DSDI task from 2020-2022 and is maintained by the National Institute of Standards and Technology. All footage consists of airborne, low-altitude video from disaster events.
Holmes-VAD provides video sequences and textual reasoning labels for explainable anomaly detection, released by pipixin321 in 2025. It serves as the official data source for the Holmes-VAD framework, which integrates Multi-modal Large Language Models with surveillance footage. The dataset is distributed under the MIT license.
StackoverflowVQA is a multimodal dataset likely containing visual question-answering data. The dataset was uploaded by mirzaei2114 and was last updated on Hugging Face on November 29, 2023. The specific content, size, and structure are not detailed in the available metadata.
Created by zjunlp for ICLR 2023, this dataset supports multimodal analogical reasoning over knowledge graphs. It provides a structured environment for the MARS framework, linking visual and textual data to relational graph structures for reasoning tasks.
A dataset titled 'Gemex Vqa' was published on the Hugging Face platform by BoKelvin on December 1, 2024. The dataset's title suggests it is related to visual question answering, a multimodal AI task. Specific details on size, format, and content are not provided in the available metadata.
Latex Vlm is a dataset published on HuggingFace by JosselinSom. The dataset was last updated on January 20, 2024. Its specific content and scale are not detailed in the available metadata.
Sam Llava Captions10M is a dataset published on HuggingFace by PixArt-alpha on January 12, 2024. The title suggests it contains image-caption pairs, likely for vision-language model training. The dataset's scale and specific content require verification after download.