Loading...
Loading...
Image-text pairs, instruction tuning, visual QA, cross-modal data, foundation model training data
1,561 datasets
The dataset comprises spatialized versions of the Libri-Trans and SLURP audio datasets, intended for enhancing translation and understanding tasks. It was authored by espnet and last updated in June 2022.
Cc2Dataset enables the extraction of multimodal pairs including image-text, audio-text, and video-text from the Common Crawl web archive. Developed by rom1504 and updated in 2023, it provides a pipeline to convert raw web documents into structured caption-media datasets. The tool is designed for big-data applications where media is paired with its surrounding document context.
VAST is an omni-modality dataset and foundation model from NeurIPS 2023 containing four distinct data categories: vision, audio, subtitles, and text. It provides a framework for multi-modal learning where visual frames are paired with corresponding sound, textual transcripts, and descriptive text.
22 million compositional questions and 113,000 images featuring scene graphs. Structured semantic representations for both images and questions support multi-step visual reasoning and logic-based evaluation.
Built from open-ended questions paired with images, categorized by their requirement for vision, language, and commonsense reasoning. It provides a framework for testing multimodal understanding through tasks that cannot be solved by a single modality alone.
Doodles Captions Blip is a dataset hosted on HuggingFace by julianmoraes, last updated in October 2022. The platform tags indicate it contains both image and text modalities, suggesting it likely contains pairs of doodle-style images and descriptive captions. The dataset's specific size, structure, and content require verification after download.
A dataset titled 'Fcd Lmv2' was authored by 'sheikh' and last updated on July 7, 2022. The dataset is associated with the tag 'Regionus', but no further descriptive details, column information, or row counts are available.
Sft Hh Rlhf is a dataset published on Hugging Face by Dahoas, with its last update recorded on 2022-12-22. The title suggests it contains data related to reinforcement learning from human feedback (RLHF) and supervised fine-tuning (SFT). The dataset's specific content, scale, and structure require verification after download.
262,110 natural language captions describing 108,965 video segments from 6 popular TV shows. The dataset facilitates multimodal video captioning by providing visual frames alongside time-aligned subtitle dialogue.
22 million compositional questions and 113,000 images featuring dense scene graph annotations. The dataset structures visual reasoning through functional programs that map out the logic required to reach an answer for each image.
Osworld G provides a benchmark for computer-use grounding through UI decomposition and synthesis, released by xlang-ai as a NeurIPS 2025 Spotlight. It facilitates the training of Large Action Models (LAMs) by generating multimodal data that pairs visual GUI elements with natural language grounding instructions.
24,903 visual question-answering pairs paired with images from the COCO dataset, categorized into multiple-choice and direct-answer formats. Each entry includes human-annotated rationales explaining the reasoning required to answer questions that necessitate external knowledge beyond the visual content.
Open-ended questions and images are the primary categories in this multimodal dataset. These samples require the integration of vision, language, and commonsense knowledge for successful completion.
Terra is a multimodal spatio-temporal benchmark for Earth science applications developed by CityMind-Lab and presented at NeurIPS 2024. It provides global-scale data across multiple modalities to support the development of advanced environmental and geographic models. The dataset was released in late 2024 to address the need for standardized benchmarks in the Earth science domain.
An image-text dataset likely containing pictures of nail sets. The dataset was published by Boyuan07 on HuggingFace and was last updated on March 29, 2023. The specific content, scale, and structure require verification after download.
DocVQA 1200 Examples is a multimodal dataset for visual question answering on documents. It contains 1,200 examples of images paired with text and questions, created by author nielsr and last updated in August 2022.
152,545 multiple-choice questions based on 21,793 video clips from 6 popular TV shows including The Big Bang Theory and Grey's Anatomy. The dataset provides paired subtitles and localized temporal annotations for every question to support multimodal reasoning.
Multilingual image captions with annotations and language labels created by experts. It supports numerous languages, including Languageadq, Languageaeu, and Languageabc. The specific number of rows, columns, and file formats is not provided.
TVQA+ provides spatio-temporal grounding labels for video question answering tasks. Developed by researchers for ACL 2020, the dataset facilitates multi-modal reasoning by linking natural language questions to specific video frames and regions.
Over 40 million images sourced from Wikimedia Commons comprise this collection curated by ryanrudes. Updated in October 2023, the repository provides a massive scale of visual data for deep learning and computer vision research.