Loading...
Loading...
Image-text pairs, instruction tuning, visual QA, cross-modal data, foundation model training data
1,551 datasets
Four categories of block diagram imagesโBD-EnKo, CBD, FC_A, and FC_Bโare referenced, though only the BD-EnKo subset is provided for summarization research. It facilitates the study of local-global fusion for visual-textual integration as presented at ACL 2024.
ClaraVid is a synthetic dataset for semantic and geometric neural reconstruction from low altitude UAV imagery. It contains 16,917 multimodal frames collected across 8 UAV missions over diverse environments. The dataset was created by radubeche and was last updated on October 31, 2025.
Facecaption 1M is a dataset of 1 million facial image-text pairs, as indicated by its title. The dataset was created by authors from OpenFace-CQUPT and published in a 2024 arXiv paper. The dataset listing on HuggingFace was last updated on August 1, 2025.
A blend of publicly available datasets for instruction tuning, including samples from OASST, CodeContests, FLAN, T0, Open_Platypus, and GSM8K. The dataset was created by NVIDIA and last updated on March 9, 2024. It consists of four columns, though specific column names and the total number of rows are not detailed in the provided metadata.
DriveLM-Data comprises two distinct components: DriveLM-nuScenes and DriveLM-CARLA. The dataset is designed to facilitate Perception, Prediction, Planning, Behavior, and Motion tasks with human-written reasoning logic. It was created by OpenDriveLab and was last updated on March 4, 2025.
LVBench is a benchmark for long video understanding featuring videos up to two hours in duration, released by zai-org in June 2024. It contains approximately 1,000 records designed to evaluate multimodal models on visual question answering and multiple-choice tasks. The dataset addresses the challenge of extracting information from extended temporal windows that exceed standard video benchmarks.
SlideVQA is a document visual question answering dataset containing between 10,000 and 100,000 records, released by NTT-hil-insight in 2023. It focuses on multi-image reasoning where models must select specific evidence slides from a deck to answer natural language questions.
Mobile3M consists of approximately 1,000 image-based records captured from Android Cuttlefish Emulators for pre-training Mobile Vision Language Models (MobileVLM). Released by Xiaomi Corporation in late 2024, the data supports research into mobile-specific vision-language tasks and UI interaction.
CaptionEmporium provides 6.92 million captions for safe-for-work images from the e621/e926 platform, extending to January 2023. The dataset includes captions generated by a large language model (mistralai/Mistral-7B-v0.1) and a multimodal model (THUDM/CogVLM), with 8 LLM and 1 CogVLM caption per image. Most captions are described as substantially larger than 77 tokens.
Vlm3R Videos is a dataset hosted on HuggingFace by author Journey9ni. The dataset was last updated on 2025-12-18 08:24:58. Its specific content and scale are not detailed in the available metadata.
633,565 multimodal records of anime, manga, and game characters sourced from 3,860 Fandom wiki sites. The dataset pairs character images with metadata extracted from HTML and descriptive captions generated by the Qwen-VL-72B-Instruct vision-language model.
AGIEval is a human-centric benchmark for evaluating foundation models. This dataset contains the Gaokao Biology subtask, processed from the AGIEval repository. The data was authored by 'hails' and last updated on January 26, 2024.
A multimodal X-ray baggage security dataset introduced by Naoufel555 in 2025. It is described as the first of its kind, designed to address limitations in representing real-world, sophisticated threats and concealment tactics. The dataset aims to move beyond closed-set paradigms with predefined labels for computer-aided screening systems.
Comprising over 37 million image-text associations extracted from Wikipedia articles. It is a multilingual dataset covering 108 languages, curated by Google Research and released by Wikimedia.
ReINTEL is a multimodal data challenge for identifying responsible information on social network sites. The dataset is associated with a competition hosted on AIHub, with top solutions invited to submit technical reports. It was created by ReliableAI and last updated in November 2024.
5,000 feature explanations generated for a 131k sparse autoencoder trained on the llava-next-llama3-8B vision-language model. The dataset includes two versions: a 'revised' set using an updated prompt and cached data, and a 'legacy' set using an older prompt and a subset of the LLaVA-NeXT-Data. It was created by lmms-lab and last updated on November 22, 2024.
svjack created this dataset to train a Pokรฉmon text-to-image model. It pairs Pokรฉmon images from the FastGAN project with captions generated by the BLIP model, adding a Chinese translation column. The dataset was last updated on Hugging Face in October 2022.
IFBench provides a benchmark for evaluating reward models designed to assess instruction-following capabilities in AI agents. The dataset was created by the THU-KEG research group and was published in March 2025 alongside their paper on agentic reward modeling. It contains samples with unique identifiers and source annotations for structured evaluation.
MMInstruction created a dataset for multimodal question answering, likely pairing images of scientific figures from arXiv papers with multiple-choice questions. The dataset was last updated on March 5, 2024. Each example includes an image path and a set of answer options, suggesting a focus on visual reasoning in academic contexts.
Annotated files for a benchmark assessing hallucination in large vision-language models applied to gastrointestinal image analysis. The dataset supports the paper 'Hallucination-Aware Multimodal Benchmark for Gastrointestinal Image Analysis with Large Vision Language Models'. It was created by sandesh-pokhrel and last updated on September 5, 2025.