Loading...
Loading...
Image-text pairs, instruction tuning, visual QA, cross-modal data, foundation model training data
1,560 datasets
PKU-Alignment processed the HH-RLHF dataset into an easy-to-use conversational and human-preference form. The dataset was last updated on November 24, 2023. Its specific scale and column structure are not detailed in the provided metadata.
M3Docvqa is a multimodal dataset published on HuggingFace by YeMoKoo on May 3, 2025. The dataset likely contains document images paired with questions and answers. Its specific size, format, and content require verification after download.
Sealvqa Gqa is a dataset hosted on HuggingFace by the author dddraxxx, last updated on August 13, 2025. Its title suggests it is related to visual question answering, likely containing image-question-answer pairs. The specific content, scale, and collection methodology require verification after download.
Multimodal knowledge graph completion data featuring text and image modalities for link prediction and relation extraction. Released by zjunlp for the SIGIR 2022 conference, it supports the training of hybrid transformer models for knowledge graph enrichment.
A test set for the OK-VQA (Outside Knowledge Visual Question Answering) benchmark, created by Multimodal-Fatima and uploaded to Hugging Face on 2023-05-29. The dataset is designed for evaluating models that answer questions about images using external world knowledge. Specific details on size, columns, and license are not provided in the metadata.
Anthropic's HH dataset reformatted into prompt, chosen, and rejected samples by Dahoas. The data was last updated on Hugging Face in February 2023. It provides a structured format for training and evaluating language models using human preferences.
ChemVQA Text is a dataset published on HuggingFace by author chandrabhuma. The title suggests it likely contains chemistry-related content for visual question answering tasks. The dataset was last updated on October 28, 2025.
Llava Recap Cc12M is a multimodal dataset created by lmms-lab and published on Hugging Face on October 10, 2024. The title suggests it likely contains image-text pairs for instruction-following tasks. The dataset's specific content, size, and structure require verification after download.
Coderonion curated this repository of public projects and datasets focusing on Large Language Models (LLM) and AI Generated Content (AIGC), last updated in August 2025. It aggregates links to specialized domains including Vision Language Action (VLA), AI for Science (AI4S), and specific models like DeepSeek and Qwen3.
7 visual reasoning tasks comprising geometric primitives designed to test the fundamental perception of Vision-Language Models. The dataset includes categories such as line intersections, circle overlaps, and nested shapes where models frequently fail despite human-level performance.
M3It provides between 1 million and 10 million bi-lingual instruction records for vision-language models, released by MMInstruction in 2023. It covers image classification and image-to-text tasks in both English and Chinese.
Developed by the MIT Visualization Group (mitvis) and updated in 2025, VisText is a benchmark dataset for chart captioning. It provides paired chart images and captions to evaluate how models interpret visual data representations.
Psychology RLHF data was used to train a LLaMA-7B reward model. The dataset was uploaded by author 'samhog' to Hugging Face on July 17, 2023. Its specific content, size, and structure are not detailed in the provided metadata.
8,000 verified multimodal examples for instruction tuning and vision-language tasks, created by the LMMS-Lab. The dataset was last updated in January 2025 and is hosted on Hugging Face.
Egotextvqa is a multimodal dataset for video question answering tasks. The dataset was created by ShengZhou97 and was last updated in April 2025. It contains video and text data, focusing on reasoning tasks that require understanding both visual and language information.
Simple Image Captions provides a collection of image-text pairs for multimodal tasks. The dataset contains at least 1,000 entries, as indicated by its size category, and was uploaded by user 'uygarkurt' to Hugging Face in August 2025.
Lmsys Arena Human Preference 55K Sharegpt is a dataset published on HuggingFace by mlabonne and last updated on October 18, 2024. The title suggests it contains 55,000 records of human preference judgments, likely sourced from the LMSys Chatbot Arena or ShareGPT platforms. The dataset's specific content and structure require verification after download.
13,003 images of 11,003 identities accompanied by 80,440 natural language descriptions. The dataset facilitates cross-modal person search by linking visual pedestrian data from surveillance cameras with detailed textual attributes.
A multimodal dataset from HuggingFace, authored by med-vlrm and last updated on 2025-06-29. The title suggests it involves medical visual question answering (VQA) using a vision-language model (VLM) on PubMed Central (PMC) images, with reasoning processes from GPT-4O. The dataset's specific size, structure, and content are not detailed in the provided metadata.
MedTrinity-25M consists of 25 million multimodal medical records featuring multigranular annotations, developed by UCSC-VLAA for ICLR 2025. The dataset provides large-scale image-text pairings designed to advance the training and evaluation of medical Multimodal Large Language Models (MLLMs).