Loading...
Loading...
Image-text pairs, instruction tuning, visual QA, cross-modal data, foundation model training data
1,551 datasets
CyberSecEval 3 Visual Prompt Injection is a multimodal benchmark from Meta for evaluating cybersecurity risks in LLMs. It contains text and image inputs designed to test visual prompt injection vulnerabilities. The dataset is part of a larger security benchmark suite and was last updated in March 2025.
GIFT-Eval Pre-training Datasets contain 4.5 million univariate and multivariate time series totaling 230 billion data points, spanning seven domains and 13 frequencies. The collection, created by Salesforce, is designed for pretraining foundation models and is explicitly aligned with the GIFT-Eval benchmark to avoid data leakage between training and testing splits.
15,110 high-quality synthetic identity documents designed for fine-tuning Vision Language Models. The dataset includes realistic driver's licenses and credit cards with diverse variations in design, layout, and content, created by sugiv. It was last updated on July 20, 2025.
A collection of multimodal histological images from the Tumor Profiler Study. It includes whole-slide H&E images, multiplexed immunofluorescence images from Ultivue panels, alignment matrices, exclusion masks, and nuclear segmentation outputs for 10 cancer samples. The dataset was authored by CTPLab-DBE-UniBas and last updated on HuggingFace in June 2025.
Scientific Openly-Licensed Publications (SciOL) and its companion dataset, MuLMS-Img, are introduced in a WACV 2024 paper by Tim Tarsi et al. The dataset is designed for image-text tasks within the scientific domain and is hosted on HuggingFace by the author Timbrt. The dataset page was last updated on April 17, 2024.
A collection of multimodal mathematics problems and reasoning chains presented in the URSA research paper. The dataset was created by the URSA-MATH organization and was last updated on February 18, 2025. It likely contains over one million examples integrating visual and textual data for training and evaluating AI models.
A Multi-Choice Visual Question Answering dataset designed to evaluate Vision-Language Models on their understanding of Korean culture. It was created through a Human-VLM collaboration and is part of research presented in a June 2024 arXiv paper. The dataset was last updated on HuggingFace on August 17, 2024.
A multimodal dataset used in the 'Reason-RFT: Reinforcement Fine-Tuning for Visual Reasoning' project. The dataset was created by author 'tanhuajie2001' and was last updated on the Hugging Face platform on April 18, 2025. Its description suggests it is intended to enhance embodied reasoning capabilities for systems like RoboBrain.
205k high-quality samples for aligning Multimodal Large Language Models with human preferences. The dataset was created by PhoenixZ and is associated with the paper 'OmniAlign-V: Towards Enhanced Alignment of MLLMs with Human Preference'. It was last updated on March 1, 2025.
2,973 pages of Chinese ancient documents form a benchmark for multimodal large model evaluation. The dataset, created by ByteDance, is designed for tasks ranging from optical character recognition to knowledge reasoning. It was last updated on the platform in September 2025.
LLaVA-NeXT Data contains between 100,000 and 1,000,000 instruction-tuning pairs for multimodal large language models, released by lmms-lab in August 2024. It provides the specific data mixtures used to train the LLaVA-NeXT and LLaVA-NeXT (stronger) models, featuring synchronized image and text instruction sets.
56,989 images depicting quintessentially Vietnamese scenes were annotated using Visual Question Answering (VQA) techniques. The dataset includes landscapes, historical sites, culinary specialties, festivals, and everyday life from various regions. It was created by 5CD-AI and last updated on Hugging Face in August 2024.
Over 55,000 real-world user and LLM conversations with associated user preferences, collected from battles between over 70 state-of-the-art LLMs. It was created for a Kaggle competition to predict human preferences in chatbot responses.
OmniAlign-V-DPO datasets contains 150,000 high-quality positive-negative pairs for Direct Preference Optimization (DPO). It is based on the OmniAlign-V datasets and was created by PhoenixZ. The dataset was last updated on March 1, 2025.
A dataset for visual question answering based on figures extracted from arXiv publications. It originates from the ArXiVQA dataset within the Multimodal ArXiv collection. The dataset was created by openbmb and was last updated on March 15, 2025.
10,000 spatial reasoning samples designed for geometric imagination from limited 2D visual perspectives. The dataset facilitates 3D mental modeling during reasoning tasks without the need for explicit 3D prior inputs or depth data.
DreamLIP-Long-Captions contains approximately 30 million image annotations consisting of detailed long captions. The captions were generated using pre-trained Multi-modality Large Language Models, with an average length of 247 characters.
Rapidata's Flux SD3 MJ Dalle Human Alignment Dataset is one of three splits from a larger collection of over 2 million human annotations for image generation models. This specific subset focuses on text-to-image alignment, while the other splits cover preference and coherence judgments. The dataset was last updated on Hugging Face in January 2025.
Document Haystack is a benchmark dataset for evaluating multimodal Large Language Models on long-context image and document understanding tasks. It was created by AmazonScience for a 2025 research paper to address the lack of suitable benchmarks for processing long documents. The specific row count, column count, and data size are not provided in the input.
DenseFusion-1M provides 1 million image-text pairs for multi-modal perception, released by the Beijing Academy of Artificial Intelligence (BAAI) in 2024. The dataset uses a Perceptual Fusion approach to combine outputs from specialized vision experts and GPT-4V into detailed descriptions.