Loading...
Loading...
Image-text pairs, instruction tuning, visual QA, cross-modal data, foundation model training data
1,551 datasets
Mantis-Instruct contains 721,000 instruction tuning examples across 14 specialized subsets. It is a fully interleaved text-image dataset designed for training multimodal models on skills like co-reference, reasoning, and temporal understanding. The dataset was created by TIGER-Lab for training the Mantis model families.
Multimodal Graph Benchmark datasets support the paper "Multimodal Graph Benchmark". The datasets are hosted by the organization mm-graph-org on Hugging Face. The repository was last updated on 2025-05-20.
PD3M is a subset of the PD12M dataset, containing 3.3 million image-caption pairs filtered for the highest aesthetic scores. PD12M is the largest public domain image-text dataset to date, designed for training foundation models while minimizing copyright concerns. The dataset was created by Spawning and introduces community-driven governance mechanisms via the Source.Plus platform.
MER2024 is a large-scale multimodal dataset released for the MER24 Challenge at IJCAI. It builds upon the MER23 and MRAC23 datasets from ACM Multimedia, expanding data volume and task diversity. The dataset aims to advance robust and practical multimodal emotion recognition.
DrivingVQA contains multiple-choice questions paired with real-world images for the French driving theory exam. The dataset was created by EPFL-DrivingVQA and was last updated in August 2025. It is designed to test knowledge of traffic laws, road signs, and safe driving practices.
MedBookVQA is a multimodal benchmark built from open-access medical textbooks to evaluate general medical AI (GMAI) and multimodal large language models (MLLMs). The dataset was created by slyipae1 and last updated on June 10, 2025. It aims to address the underutilization of structured textbook knowledge for systematic AI evaluation.
PMC-VQA contains 227,000 visual question-answering pairs associated with 149,000 medical images sourced from PubMed Central. Released by RadGenome and updated in July 2024, the collection includes a specialized version focused on noncompound images to facilitate cleaner model training. The dataset is organized into training and testing splits with a dedicated clean test set for benchmarking.
This is the training split for the Massive Multimodal Embedding Benchmark (MMEB), used to train VLM2Vec models as described in an ICLR 2025 paper. It comprises data from 20 out of 36 datasets selected for evaluating multimodal embedding models across 4 meta tasks.
Approximately 38,000 image-text pairs, with 10,000 sourced from LAION and 28,000 from nsfw_detect. Captions were generated by the LLaVA-NeXT model using a prompt to describe attributes of people. The dataset was created by author zxbsmk and last updated on HuggingFace in July 2024.
MotionBench is a benchmark dataset designed to evaluate and improve the fine-grained motion comprehension capabilities of vision-language models. The dataset was created by zai-org and released in January 2025. It aims to guide the development of more capable video understanding models.
Image-text pairs from the MS COCO 2017 dataset, sourced from cocodataset.org. The data is provided in two formats: a dense format with several sentences per image row and a long format with one caption per row, expanding the dataset length by a factor of five.
2,101 image-text pairs designed for unsupervised post-training of multi-modal large language models. Each entry includes a 'problem' field with a geometric reasoning question and an 'answer' field containing the corresponding solution.
Over 250 million human ratings on more than 2.2 million cartoon captions form a multimodal preference dataset for creative tasks. The dataset was created by researchers and is associated with a paper titled 'Humor in AI: Massive Scale Crowd-Sourced Preferences and Benchmarks for Cartoon Captioning'. It was last updated on the platform in September 2024.
OSWorld provides task examples, retrieval documents, and virtual machine snapshots for benchmarking multimodal agents performing open-ended tasks in real computer environments. The dataset was created by xlangai and last updated in October 2024. It supports evaluation on both x86 and arm64 machine architectures using VMware or VirtualBox.
Misraj Structured Data Dump (MSDD) is a large-scale Arabic multimodal dataset created by Misraj. It was extracted and filtered from Common Crawl dumps using a WASM pipeline and uniquely preserves the structural integrity of web content by providing markdown output. The dataset was last updated on September 29, -2025.
SPAR-Bench contains 7,207 manually verified spatial reasoning question-answer pairs across 20 distinct tasks, released by jasonzhango in 2025. The benchmark evaluates vision-language models using single-view, multi-view, and video modalities to test spatial perception and reasoning capabilities.
A curated collection of five established code instruction datasets formatted for LLM training. The datasets, including Magicoder-OSS-Instruct-75K and glaive-code-assistant-v3, have been processed into the LLAMA chat format with markdown for code snippets. It was created by MaLA-LM and last updated in July 2024.
UltraInteract SFT is a large-scale, high-quality alignment dataset designed for complex reasoning tasks. The dataset, created by openbmb, includes preference trees with reasoning chains, multi-turn interaction trajectories, and pairwise data for preference learning. It was last updated on April 5,ๆไปฌๅ็ฐไบไธไธช้ฎ้ข๏ผๅจ็ๆ summary ๆถ๏ผๆไฝฟ็จไบ
Evaluation records generated by VLMEvalKit, reflecting the OpenVLM Leaderboard. The dataset was last updated on 2025-04 08 06:23:26 and is maintained by the author VLMEval. It contains results from evaluating various Vision-Language Models (VLMs) on different benchmarks.
Synthetic Visual Genome (SVG) datasets are designed for training Vision-Language Models on scene graph understanding and dense visual relationships. The datasets were created by author jamepark3922 and were last updated on June 11, 2025. They are hosted on the Hugging Face platform.