Loading...
Loading...
Image-text pairs, instruction tuning, visual QA, cross-modal data, foundation model training data
1,560 datasets
Sam Llava Captions10M is a dataset published on HuggingFace by PixArt-alpha on January 12, 2024. The title suggests it contains image-caption pairs, likely for vision-language model training. The dataset's scale and specific content require verification after download.
VQAv2_train is a dataset for visual question answering tasks, likely containing pairs of images and questions with corresponding answers. The dataset was uploaded by Multimodal-Fatima to Hugging Face and last updated in April 2023.
A multimodal dataset published on huggingface by MAINLAND on July 22, 2024. The title suggests it likely contains satellite imagery paired with textual instructions, intended for training vision-language models. The specific content, scale, and geographic scope require verification after download.
25,000,000 image-caption pairs structured for large-scale multimodal model training. The collection expands upon the 4M Img Caps framework to provide a higher volume of text-image associations for vision-language tasks.
Vision-language document retrieval training pairs transformed from the vidore/colpali_train_set for Tevatron compatibility. The data is structured to support the training of multi-vector retrieval models like ColPali within the Tevatron ecosystem.
A dataset named 'Small Clean Llava Instruct Mix' was published by the author 'damerajee' on the Hugging Face platform on 2024-05-27. The title suggests it is a curated collection of instruction-following examples, likely for training or fine-tuning vision-language models. Its specific content, size, and structure require verification after download.
RoboVQA contains video and text data for training models to answer questions about robotic scenes. The dataset includes over 100,000 entries, as indicated by its Hugging Face size category. It was created by Tianli and last updated in July 2025.
Over 100,000 entries combine images with question-answer pairs for visual question answering tasks. The dataset was created by lmms-lab and last updated in January 2024.
A dataset for visual question answering on documents, published by HuggingFaceM4 on December 18, 2023. The dataset likely contains images of documents paired with questions and answers. Its specific scale, columns, and content require verification after download.
Multimodalpv is a dataset published on HuggingFace by wealan123123. Its last update was recorded on 2025-07-05. The specific content, size, and structure are unknown from the provided metadata.
Persian VQA is a dataset for Visual Question Answering tasks in the Persian language. It was published by AUT-NLP on the Hugging Face platform and was last updated on December 12, 2024. The dataset's specific content, scale, and structure are not detailed in the available metadata.
Multimodal product data from the Rakuten France e-commerce platform, contributed by user yassinemtg and updated in February 2025. The dataset is designed for classification tasks, combining visual and textual information. It focuses on products sold in the French market.
The VQA dataset contains open-ended questions about images, requiring an understanding of vision, language, and commonsense knowledge to answer. It was created by HuggingFaceM4 and last updated in June 2022.
S3E provides multimodal sensor data for multi-robot collaborative Simultaneous Localization and Mapping (SLAM), developed by DapengFeng and published in IEEE Robotics and Automation Letters (RA-L). The dataset facilitates research into multi-agent systems by providing synchronized data streams from multiple robotic platforms.
Mimic Cxr Vqa likely contains chest X-ray images paired with questions and answers for visual question answering tasks. Published on huggingface by MiniMedMind on November 3, 2024, its exact size and content are unspecified.
TVR provides video-subtitle pairs and natural language queries for temporal moment retrieval, introduced by Jie Lei at ECCV 2020. The collection focuses on the TV show domain, requiring models to utilize both visual and textual dialogue features to locate specific events.
A dataset published on huggingface by henry-07 on April 6, 2025. It likely contains pairs of satellite imagery from the Sentinel program and corresponding textual captions. The specific volume, format, and column structure are unknown.
A multimodal dataset likely containing landscape imagery paired with compositional descriptions or labels. The dataset was authored by TomEijkelenkamp and published on the HuggingFace platform on May 22, 2024. Specific details regarding content, size, and structure are not provided.
3,700 question-answer pairs linked to images and a knowledge base of 1.5 million Wikipedia entities. The dataset facilitates visual entity retrieval where answers are specific entities rather than generic object labels.