Loading...
Loading...
Image-text pairs, instruction tuning, visual QA, cross-modal data, foundation model training data
1,551 datasets
Synthetic Visual Genome (SVG) datasets are designed for training Vision-Language Models on scene graph understanding and dense visual relationships. The datasets were created by author jamepark3922 and were last updated on June 11, 2025. They are hosted on the Hugging Face platform.
EndoVQA-Instruct is a multi-modal dataset containing endoscopy images and associated text, designed for benchmarking multi-modal large language models in medical analysis. The dataset includes images from the in-house WCE2025 collection and is managed by author Saint-lsy. Access to the data is restricted and requires formal request and approval.
A benchmark for evaluating embodied spatial understanding in Large Vision-Language Models, created by Phineas476 and last updated on June 23, 2024. It comprises 3,640 question-answer pairs automatically derived from embodied scenes, covering 294 object categories and 6 spatial relationships from an egocentric perspective. The associated EmbSpatial-SFT dataset provides instruction-tuning data for spatial tasks.
InfinityMATH is a scalable instruction tuning dataset for programmatic mathematical reasoning. The dataset was created by BAAI and was last updated on September 3, 2024. Its construction pipeline emphasizes decoupling numbers from problems to synthesize number-independent programs.
DARE (Diverse Visual Question Answering with Robustness Evaluation) is a multiple-choice VQA benchmark created by cambridgeltl. It evaluates Vision-Language Model performance across five diverse categories and includes four robustness-oriented evaluations based on variations in prompts, answer options, output format, and the number of correct answers. The validation split contains images, questions, answer options, and correct answers.
RLHF-V-Dataset is a large-scale multimodal feedback dataset constructed using open-source models for reinforcement learning. It was released by the openbmb organization in May 2024 and has been utilized in models like MiniCPM-V 2.0. The dataset is designed for diverse tasks involving computer vision and large language models.
7 million diverse images sourced from datasets like COYO-700M and MS-COCO 2017, each paired with both a short and a detailed caption. This re-captioned dataset was created by DAMO-NLP-SG for training the VideoLLaMA 3 multimodal foundation model and was last updated in February 2025.
Aggregating 10,000 to 100,000 medical image-text pairs, this 2024 release from FreedomIntelligence serves as a standardized evaluation suite for multimodal LLMs. It incorporates six distinct benchmarks including VQA-RAD, SLAKE, and PathVQA to test models like HuatuoGPT-Vision.
ShowUI-desktop-8K consists of approximately 8,000 PC-based UI grounding records featuring screenshots and annotations originally sourced from OmniAct. Created by showlab and updated in March 2025, the dataset provides visual and textual data for desktop interface interaction research. It utilizes GPT-4o to augment original labels with detailed attributes regarding appearance and functionality.
DocVQA consists of 10,000 to 100,000 document images paired with question-answer sets, formatted by lmms-lab in 2024. This version is derived from the original 2020 DocVQA research to facilitate standardized evaluation of Large Multi-modality Models (LMMs). It provides a structured framework for testing how models interpret text and layout within diverse document types.
MemoryBench provides benchmark tasks across spatial memory and action recall categories for robotic manipulation. It serves as the evaluation foundation for the SAM2Act+ framework, focusing on the integration of visual foundation models with memory architectures.
12 million unique identifiers (UIDs) reference a filtered subset of the larger DataComp-1B-BestPool dataset. Apple created this collection to train image-text models that outperform those trained on established benchmarks like CC-12M and YFCC-15M. The dataset card was last updated in February 2025.
A multimodal dataset from HuggingFace, created by 5CD-AI and last updated on 2024-11-27. The description suggests it contains examples of visual reasoning tasks where models are instructed to explain their reasoning step-by-step before providing a final answer, as shown in a provided example about counting straws and cups.
A benchmark for evaluating multimodal embedding models, covering 4 meta tasks and 36 datasets. The dataset was created by TIGER-Lab and published in the paper 'VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks'. It was last updated on Hugging Face on October 28, 2024.
ShareGPT4Video provides 4.8 million multi-modal video captions generated via GPT-4-Vision to improve modality alignment in Large Video-Language Models. Developed by the ShareGPT4Video team in 2024, the collection includes a specific 40,000-record subset for fine-grained visual perception tasks.
Filtered WIT is an image-text dataset derived from the Wikipedia Image Text (WIT) dataset, containing 10,000 samples per archived tar file. Each sample includes a .jpg image, a .txt caption, and a .json metadata file. The dataset is provided by LAION and was last updated in January 2022.
PuzzleVQA is a dataset created by declare-lab for evaluating large multimodal models. The dataset likely contains puzzles based on abstract patterns to test general intelligence and reasoning abilities. It was last updated on Hugging Face on February 26, 2025.
Atsunori converted the NVIDIA HelpSteer2 dataset into preference pairs for training Direct Preference Optimization models. The conversion is based on the helpfulness score of responses, with the higher-scoring response designated as the chosen one. The dataset was last updated on July 11, 2024.
Hindi VQA is a dataset for visual question answering in Hindi. It was filtered to be more balanced and processed to create sentence embeddings using a pre-trained transformer model, followed by KMeans clustering and t-SNE for visualization. The dataset was uploaded by damerajee to Hugging Face on June 2, 2024.
Supervised fine-tuning pairs built from rejected responses in the Anthropic HH-RLHF dataset. Each example provides a multi-turn conversation history formatted with Human/Assistant turns and the subsequent rejected assistant turn.