Loading...
Loading...
Image-text pairs, instruction tuning, visual QA, cross-modal data, foundation model training data
1,551 datasets
A subset of the LAION/CC/SBU dataset filtered for more balanced concept coverage distribution, constructed for the pretraining stage of visual instruction tuning. It contains synthetic captions generated by BLIP for reference and aims to build large multimodal models towards GPT-4 vision/language capability. The dataset was created by liuhaotian and last updated in July 2023.
10,000 real-world WebDev Arena battles involving 10 state-of-the-art large language models (LLMs). The dataset was created by lmarena-ai and was last updated on March 10, 2025. It is hosted on the Hugging Face platform.
6.5 million keyframe images are interleaved with 0.8 billion words of ASR text from instructional videos, forming a corpus for vision-language pretraining. The dataset was created by DAMO-NLP-SG for the research project '2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining' and was last updated in March 2025.
MER2023 is a large-scale multimodal emotion recognition dataset created by MERChallenge for a research competition. The dataset is designed to address challenges in real-world deployment, such as costly labeling and modality-related issues. It was last updated on November 8, -2025.
Zenodo10K consists of over 10,000 PowerPoint (.pptx) files crawled from the Zenodo repository by the PPTAgent research team. Released in early 2025, it provides a large-scale collection of presentation documents for document understanding and automated slide generation tasks.
RLAIF-V-Dataset is a large-scale multimodal feedback dataset created by unsloth. It provides 83,132 preference pairs, where instructions are collected from a diverse set of sources. The dataset was last updated on Hugging Face on 2024-09 26.
Between 1,000 and 10,000 AI-generated images from midjourneysref.com comprise this collection of style references and automated captions. Created by peteromallet and updated in July 2025, the records are optimized for machine learning via Parquet storage and smart cropping.
27,519 images and corresponding question-answer pairs translated from the GQA train_balanced and testdev_balanced splits into Russian. The data underwent gpt-4-turbo translation followed by manual validation to correct errors and remove safety-filtered content. It is structured for use within the lmms-eval pipeline to support multimodal model benchmarking.
PointArena is a dataset for probing multimodal grounding through language-guided pointing. It was created by researchers from the University of Washington and the Allen Institute for Artificial Intelligence. The dataset page was last updated on May 17, 2025.
5,040 text-image pairs across 13 safety scenarios including hate speech and illegal activities. The dataset provides a benchmark for evaluating the safety alignment of multimodal large language models. It specifically targets vulnerabilities in vision-language models through adversarial prompts.
BAAI developed ShareRobot, a collection of 51,403 robotic episodes with 30 frames each, to enhance multi-dimensional robotic capabilities. The data includes labels for task planning, object affordance, and end-effector trajectories using 50 distinct prompt templates.
OlympiadBench contains between 1,000 and 10,000 bilingual scientific problems in mathematics and physics, designed for evaluating AGI reasoning. Created by Hothan and published at ACL 2024, the dataset includes both text-based and multimodal questions in English and Chinese.
UniWorld V1 provides between 1,000 and 10,000 image-text pairs sourced from the BLIP3o-60k collection, released by LanguageBind in June 2025. It utilizes Geneval-style annotations to facilitate the training of high-resolution semantic encoders for unified visual understanding and generation.
ARGUS is a framework for calculating hallucination and omission costs in free-form video captions. The dataset, created by tomg-group-umd, provides metrics to quantify the degree of hallucinated and omitted content in video-language model outputs. It was last updated on June 10,ๆไปฌๅ็ฐไบไธไธช้ฎ้ข๏ผๆจๆไพ็ๅๅงๆ่ฟฐไธญๅ ๅซไบไธญๆๆๆฌใๆ นๆฎๆไปค๏ผๆ้่ฆๅฐ่พๅ ฅ็ฟป่ฏๆ่ฑๆใไปฅไธๆฏ็ฟป่ฏๅ็ๆ่ฟฐ๏ผๅนถๅบไบๆญค็ๆ่พๅบใ
Conceptual Captions (CC3M) contains approximately 3.3 million images annotated with captions. The dataset was created by pixparse, with images and their raw descriptions harvested from the web, specifically from the Alt-text HTML attribute.
SoccerBench contains approximately 10,000 standardized multimodal multiple-choice question-answering pairs across 14 distinct soccer understanding tasks. Developed by Homie0609 and released in 2025, the benchmark utilizes automated pipelines combined with manual verification to evaluate multi-agent systems.
NVIDIA's Cosmos-Reason1 SFT dataset pairs videos with text annotations for embodied reasoning. The annotations support tasks from multiple sources including BridgeDatav2, RoboVQA, Agibot, HoloAssist, and AV. Released on Hugging Face in May 2025, it also includes RoboFail data for benchmarking.
A fine-tuning dataset for the RDT-1B diffusion foundation model, as described in the paper 'RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation'. The dataset was created by robotics-diffusion-transformer and last updated on Hugging Face on 2024-10-13. The associated research paper was published on arXiv in October 2024.
A benchmark for evaluating Large Multimodal Models (LMMs) on cultural context, local sensitivities, and low-resource language support, integrating visual cues. The dataset was created by MBZUAI and was last updated on February 28, 2025. It is associated with a CVPR 2025 publication.
GroundCUA is a large dataset of real UI screenshots paired with structured annotations for building multimodal computer use agents. It covers 87 software platforms across productivity tools, browsers, creative tools, communication apps, development environments, and system utilities. The dataset was created by Fhrozen and last updated on Hugging Face in November 2025.