Loading...
Loading...
Image-text pairs, instruction tuning, visual QA, cross-modal data, foundation model training data
1,551 datasets
MINT-1T is an open-source multimodal interleaved dataset designed for pretraining research. It contains one trillion text tokens and 3.4 billion images, representing a 10x scale-up from prior open-source collections and includes sources like PDFs and arXiv papers. The dataset was created by a team from the University of Washington and was last updated on the platform in September 2024.
Synthetic annotations for images and documents created by Facebook for the PLM model. The dataset includes generated captions for images from SA1B, OpenImages, and Object365, and question-answer pairs for documents from ArXivQA, UCSF, and PDFAcc. The dataset was last updated on April 21, 2025.
PVIT-3M is a dataset of 3 million image-text pairs designed for tuning Multimodal Large Language Models (MLLMs) on personalized visual instruction tasks. It was created by Sterzhang and introduced in the paper "Personalized Visual Instruction Tuning". The dataset was last updated on November 2, 2024.
CameraBench is a dataset for understanding camera motions in videos, created by syCen. It includes approximately 1,400 annotated video clips used for fine-tuning models like Qwen2.5-VL. The dataset supports evaluation of Structure-from-Motion, Visual Language Models, and scene-aware semantic analysis.
MedXpertQA is a benchmark dataset containing 4,460 questions for evaluating expert-level medical knowledge and reasoning. It was created by TsinghuaC3I and features both text-based and multimodal tasks that integrate structured clinical data with images. The dataset was last updated in July 2025.
A dataset named 'Llava Onevision 1.5 Rl Data' published on the Hugging Face platform by author mvp-lab. The dataset was last updated on 2026-01-06. Platform tags indicate it contains both image and text modalities, suggesting it is likely a multimodal dataset for training or fine-tuning vision-language models.
73 newsbites from eight major European newspapers published in the three days following the January 8, 2023, attack on Brazil's federal government buildings. Isabel Alonso Belmonte collected this multilingual sample to explore the multimodal construction of the political event. The dataset was last updated on October 14, 2025.
A small dataset of synthetic text captions describing food and non-food images. The text captions were generated using the Mistral Chat and Mixtral language models. It was created by author mrdbourke and last updated on June 7, 2024.
A collection of 27 million images, each paired with a long caption generated by the Qwen2.5-VL-7B-Instruct model. The dataset was created by the BLIP3o organization and published on Hugging Face in June 2025. It is intended for pretraining vision-language models.
RLAIF-V provides between 10,000 and 100,000 multimodal preference-alignment records developed by OpenBMB to improve Multimodal Large Language Model (MLLM) trustworthiness. The data utilizes AI-generated feedback to refine model responses, serving as a core training component for the MiniCPM-V 4.5 model released in 2024.
Published on HuggingFace by author mm-eval, with a last update timestamp of 2026-01-12 07:15:59. The dataset's title suggests it is a toolkit for evaluating vision-language models. Its specific content, scale, and data types require verification after download.
SFT Datasets of ChemVLM, part of the MMChem Series. The dataset includes both image bytes and conversations for training multimodal large language models in chemistry. It was last updated on October 29, 2025, by the author di-zhang-fdu.
MIRIAD contains 4.4 million medical question-answer pairs. The pairs were distilled from peer-reviewed biomedical literature using large language models, providing structured data for downstream tasks.
HH_length_biased_15k is a 15,000-sample subset of Anthropic/hh-rlhf, created for the paper 'Understanding impacts of human feedback via influence functions'. Taywon Min authored this dataset, which was last updated on December 5, 2024. It contains 976 samples where responses were intentionally flipped to be lengthy.
SPA-VL contains 100,788 samples across 6 harmfulness domains and 53 subcategories, released by researcher sqrti in mid-2024. The dataset facilitates safety preference alignment for Vision Language Models (VLMs) using multimodal image-text pairs.
Over 5 billion tokens of Traditional Chinese Medicine text form the largest existing TCM corpus, sourced from websites and books. FreedomIntelligence released this multimodal dataset for pre-training the ShizhenGPT model. It was last updated in September 2025.
ShizhenGPT's pre-training dataset contains over 5 billion tokens of Traditional Chinese Medicine text from websites and books, along with a large-scale image-text dataset. The dataset was created by FreedomIntelligence and was last updated in September 2025.
400 full-length high-definition talking face videos, split into 81-frame clips and paired with audio embeddings. The dataset was curated by global-optima-research and last updated on June 4, 2025. It is intended for tasks in talking-head generation and multimodal avatar synthesis.
Synthetic annotations for video understanding tasks, covering the YT-1B and Ego4d datasets. The dataset includes video captions and multiple-choice question-answer pairs, as described in the associated technical report. It was created by Facebook and last updated on the Hugging Face platform in April 2025.
Coral images from 3 oceans are used in this dataset. CoralVQA contains 12,805 real-world coral images from 67 genera, paired with 277,653 question-answer pairs assessing ecological and health conditions. The dataset was created by CoralReefData and last updated on September 29,ๆไปฌๅ็ฐไบไธไธช้ฎ้ข๏ผ่ฏทๅ ณ้ญๅฝๅๅทฅๅ ท๏ผไฝฟ็จโ่็ฝๆ็ดขโ้ๆฐๅฐ่ฏไธไธใ