Loading...
Loading...
Image-text pairs, instruction tuning, visual QA, cross-modal data, foundation model training data
1,551 datasets
PRIMUS is a pioneering collection of open-source datasets for cybersecurity LLM training. The Primus-Reasoning subset contains multiple cybersecurity reasoning tasks sourced from CTI-Bench, including CTI-RCM, CTI-VSP, CTI-ATE, and CTI-MCQ. It was augmented in June 2025 with distilled samples from DeepSeek-R1, incorporating intermediate reasoning steps and final answers.
Over 1,700,000 Visual Question Answering samples derived from figures and charts in biomedical articles from PubMed Central. This preliminary release by DermaVLM was last updated on Hugging Face in August 2025. A full dataset card and accompanying research paper are currently in preparation.
3,192 imageβannotation pairs form the CalliBench dataset for evaluating Vision Language Models on Chinese calligraphy. The dataset, created by author gtang666, includes tasks for full-page recognition and contextual visual question answering. It was last updated on Hugging Face in July 2025.
OpenGVLab's OmniCorpus-YT is a large-scale multimodal dataset containing 10 million image-text interleaved documents collected from YouTube videos. The dataset is part of the broader OmniCorpus project, which encompasses billions of images, and was presented in an ICLR 2025 Spotlight paper. The repository was last updated on March 20, 2025.
EditScore provides a series of open-source reward models ranging from 7B to 72B parameters for evaluating instruction-guided image editing. The benchmark likely contains data used to train and evaluate these models, with the largest model reportedly surpassing GPT-5 on their internal benchmark. The dataset was last updated on October 17, 2025.
Vript is a fine-grained video-text dataset constructed by Mutonix, containing 12,000 annotated high-resolution videos split into approximately 400,000 clips. The annotation is inspired by video scripts, detailing scene content, shot types, and camera movements. The dataset was last updated on June 11, 2024.
BLIP3-OCR-200M is a dataset designed to improve Vision-Language Models' ability to process text within images. It was created by Salesforce and was last updated on February 3, 2025. The dataset likely contains images integrated with Optical Character Recognition (OCR) data to address limitations in interpreting documents and charts.
MINT-1T contains 1 trillion text tokens and 3.4 billion images, scaling open-source multimodal data by a factor of ten. The dataset was created by a team from the University of Washington and released in 2024, incorporating sources like PDFs and arXiv papers to facilitate research in multimodal pretraining.
Formosa Vision is an open-source visual language dataset focused on Taiwanese local culture, containing over two thousand images selected from the National Cultural Memory Bank 2.0. The dataset was created by the Twinkle AI community using a hybrid method where visual language models generated image dialogues, which were then manually checked and revised by participants. It was last updated on November 20, 2025.
A collection of filtered image-text pairs from academic resources, used for pre-training the Open-Qwen2VL multimodal large language model. The dataset includes subsets like ccs_ebdataset, derived from CC3M-CC12M-SBU and filtered by CLIP, and datacomp_medium_dfn_webdataset. It was created by weizhiwang and last updated on April 16, 2025.
April 2025 is the last update date for this dataset of 1.6 million pathology image-text pairs. It was created by jamessyx and is intended for training Vision Language Models (VLMs) like CLIP. The dataset is designed to support applications in pathology, such as zero-shot image classification and Whole Slide Image analysis.
KIE-HVQA is a dataset supporting research on mitigating Optical Character Recognition hallucinations in multimodal large language models. The dataset was created by bytedance-research and is associated with a paper accepted by the NeurIPS 2025 Main Conference. The data likely contains multimodal document samples for evaluating and improving OCR integration in vision-language models.
491 images from the CountBench benchmark evaluate object counting in vision-language models. The dataset was automatically curated and manually verified from the LAION-400M dataset, introduced by author vikhyatk for the PaliGemma model.
TIGER-Lab's OmniEdit Filtered 1.2M dataset, last updated December 2024, is designed for training a general-purpose image editing model. The dataset was created by filtering data using large multimodal models like GPT-4o for quality assessment. It provides supervision for seven distinct image editing tasks.
GenAI-Bench is a benchmark for evaluating multimodal large language models' ability to judge the quality of AI-generated content. The dataset is based on human preference data collected via the GenAI Arena platform and is maintained by TIGER-Lab. It was last updated on 2024-09-08.
Droid 1.0.1 contains 95,617 robotic episodes and 27,618,651 frames collected using Franka robots. Created by lerobot and updated in July 2025, it documents 49,611 distinct tasks at 15 FPS.
ScreenSpot provides over 1200 text instructions paired with screens from iOS, Android, macOS, Windows, and web environments for evaluating GUI grounding. Researchers from Nanjing University and the Shanghai AI Laboratory created this benchmark to test large multimodal models. The dataset was last updated in April 2024.
DLC-Bench is a dataset for benchmarking detailed and localized image and video captioning. It was created by researchers from NVIDIA, UC Berkeley, and UCSF, including Long Lian, Yifan Ding, and others. The dataset was last updated on the Hugging Face platform on April 24, 2025.
Magma is a foundation model for multimodal AI agents, developed by researchers from Microsoft Research, University of Maryland, University of Wisconsin-Madison, KAIST, and University of Washington. The associated dataset, Magma-AITW-SoM, likely serves as a benchmark for evaluating multimodal agent capabilities. The dataset page was last updated on 2025-04-29.
1,981,157 synthetically generated chart images with ground truth annotations form this multimodal dataset. Created by the docling-project and last updated in July 2025, it is designed for training the SmolDocling model on chart-based document understanding. Charts were rendered at 120 DPI using visualization libraries like Matplotlib, Seaborn, and Pyecharts.