Loading...
Loading...
Image-text pairs, instruction tuning, visual QA, cross-modal data, foundation model training data
1,551 datasets
LLaVA-OneVision-1.5-Instruct is a 22 million instruction dataset curated by MVP-Lab for training large multimodal models. It was developed to support the LLaVA-OneVision-1.5 model family and was last updated in November 2025.
DAD-3DHeads provides dense 3D annotations for head alignment and reconstruction from single images, published by PinataFarms for CVPR 2022. The data includes FLAME model parameters and 3D landmark coordinates for 3D Morphable Model (3DMM) fitting. It was developed to address the lack of diverse head poses in existing 2D landmark datasets.
Mizzen AI, CUHK MMLab, and academic partners released the Human Preference Dataset v3 (HPDv3) in August 2025. It comprises 1.08 million text-image pairs and 1.17 million annotated pairwise comparisons for modeling human preferences. The dataset is associated with the ICCV 2025 paper 'HPSv3: Towards Wide-Spectrum Human Preference Score'.
AllenAI provides a dataset for visual question answering tasks. It contains image-text pairs designed for evaluating multimodal language models. The dataset was updated in January 2026.
NVIDIA's Nemotron Content Safety Dataset V2 contains 33,416 annotated interactions between humans and LLMs, released in June 2025. It provides structured training, validation, and test splits curated from human preference data to support safety alignment and toxicity detection.
16,130 images and 1,353 human-annotated multiple-choice questions across 9 distinct scenarios form this benchmark for evaluating vision-centric multimodal retrieval-augmented generation (RAG) abilities in Large Vision Language Models (LVLMs). The dataset, named MRAG-Bench, was created by uclanlp and last updated on November 5, 2024. It provides a systematic evaluation framework for both open-source and proprietary models.
Gemini 3 Pro benchmark dataset for multimodal evaluation. The dataset was created by AliMertTemizsoy and published on Hugging Face in January 2026. It contains image-text pairs for visual question answering tasks.
265,016 images from MS COCO are paired with 1,105,904 questions and 11,059,040 ground-truth answers. The dataset is structured into balanced pairs where each question is associated with two similar images that result in different answers to minimize language bias.
140,000 user votes comparing two language models on a conversation, collected via the LM Arena platform. Each row contains a single vote with the full conversation history and metadata like the winning model and evaluation session. The dataset was created by lmarena-ai and last updated in August 2025.
Meta released the PE Video Dataset (PVD) in April 2025, featuring 1 million high-quality videos for perception encoding research. The collection includes 120,000 clips with human-verified annotations, while the full set is accompanied by descriptions and keywords.
MCD-rPPG is a large-scale multimodal dataset for remote photoplethysmography and health biomarker estimation from video. The dataset contains synchronized video recordings from multiple camera views, designed for the paper 'Gaze into the Heart: A Multi-View Video Dataset for rPPG and Health Biomarkers Estimation'.
Over 10 million image-text pairs constitute this global-scale remote sensing dataset, which also includes geographical location and resolution information. The dataset was authored by 'lcybuaa' and was last updated on the Hugging Face platform in June 2025.
Over 1 million curated image-caption pairs were released by the Frontier Research Team at takara.ai in February 2025. The collection was produced by consolidating and standardizing multiple open-source datasets through a 96-hour computational validation process across three nodes.
SToCorpus-88M is a pre-training dataset used for the SToFM multi-scale foundation model for spatial transcriptomics. The dataset is associated with a research paper and model code published on GitHub. Specific details on data volume, structure, and features are not provided in the input.
Salesforce developed UniDoc-Bench in 2024 as a benchmark for multimodal retrieval-augmented generation (MM-RAG). It contains 1,700+ multimodal QA pairs derived from a corpus of 70,000 real-world PDF pages across eight domains. The data links evidence across text, tables, and figures to support complex document-based reasoning tasks.
Multimodal-Mind2Web aligns HTML documents from the Mind2Web dataset with their corresponding webpage screenshot images. The dataset was created by osunlp to address the inconvenience of loading images from the original ~300GB raw dump and was last updated on June 5, 2024.
ConstructionSite 10k contains 10,013 construction site images and annotations released by LouisChen15 in October 2025. The collection is partitioned into 7,009 training and 3,004 test samples specifically designed to evaluate Vision Language Models (VLMs) in civil engineering contexts.
MindCube is a benchmark for evaluating Vision Language Models' ability to form spatial mental models from limited visual information. It contains 21,154 questions across 3,268 images, created by MLL-Lab. The dataset was last updated in November 2025.
FashionRec is a multimodal dataset with 331,124 samples designed to train Vision-Language Models for fashion recommendation. It was created by Anony100 and last updated on October 14, 2025. The dataset integrates human-curated outfits with dialogue data sourced from three fashion datasets: iFashion, Polyvore-519, and Fashion32.
Encompassing 846,113 total text samples for Amharic language model training, split into 761,501 training and 84,612 test samples. It was created by YoseAli and last updated in August 2025.