Loading...
Loading...
Image-text pairs, instruction tuning, visual QA, cross-modal data, foundation model training data
1,560 datasets
96 challenging questions based on images from OpenImages form this evaluation benchmark for hallucination in Large Multimodal Models. It includes ground-truth answers and image contents. The dataset was created by Shengcao1006 and uploaded in November 2023.
Llava Critic Grpo Dataset is a collection of data for evaluating and critiquing multimodal AI models. Published by the organization lmms-lab on the Hugging Face platform, it was last updated on June 24, 2025. The dataset's specific content and structure are not detailed in the available metadata.
COREVQA is a multimodal benchmark dataset for visual question answering tasks. It combines images with corresponding textual questions and answers, designed for evaluating AI models. The dataset originates from the UCI platform and is associated with computer vision and natural language processing research.
The 'Omni Med Vqa Mini' dataset was published on the Hugging Face platform by author 'simwit' and last updated on 2025-04-24 17:24:49. Its title suggests it contains medical images paired with questions and answers. The specific content, size, and structure require verification after download.
A webdataset likely containing 3 million examples for training multimodal AI models, as indicated by the title. It was published by author mvp-lab on the Hugging Face platform and last updated on September 20, 2025. The dataset appears to be associated with the LLaVA (Large Language and Vision Assistant) project, suggesting it contains paired image-text data.
12,000,000 English image-caption pairs derived from Google's Conceptual 12M dataset. The collection is structured in a TSV format containing image URLs, local filenames, and descriptive captions for each entry.
Iconclass is a classification system for art and iconography. The dataset likely contains structured codes and descriptions for visual symbols and themes. It was published on HuggingFace by davanstrien and last updated on September 10, 2025.
Psychocounsel Preference is a text dataset for preference learning in psycho-counseling contexts, created by the Psychotherapy-LLM author group. It is designed to unlock large language models' counseling skills, as described in the associated research paper. The dataset was last updated in March 2025.
A multimodal dataset for vision-language model training, hosted on HuggingFace by author Journey9ni. The dataset was last updated in June 2025 and is categorized as containing up to 100,000 entries. It is designed for tasks involving the 3R framework.
Vqa Multitask is a dataset for multitask learning, likely combining visual and textual data for question answering. It was published on huggingface by author WaltonFuture and was last updated on July 9, 2025. The specific content, scale, and structure require verification after download.
Therapeutics Data Commons (TDC) is a collection of multimodal benchmarks and datasets for drug discovery and therapeutic science developed by the Harvard MIMS group. Updated as recently as July 2025, it provides a standardized framework for evaluating machine learning models across the drug development pipeline.
SurveillanceVQA 589K is a dataset for visual question answering tasks, likely containing image-question-answer pairs. The dataset was created by author fei213 and was last updated on Hugging Face on 2025-05-16 03:52:51. Its specific content, such as the source and nature of the surveillance imagery, requires verification after download.
A subset of the VQAv2 dataset, which is a benchmark for visual question answering tasks. The dataset was published on the Hugging Face platform by user 'merve' and was last updated on August 8, 2024. The specific scale, content, and structure of this 'Small' version require verification after download.
Aggregating multiple benchmarks for table understanding, this repository by esborisova was updated in September 2025. It categorizes resources into tasks such as table structure recognition, table-to-text, and table question answering.
A benchmark for synthetic data detection created by bczhou and released on November 5, 2024. The data supports the paper LOKI: A Comprehensive Synthetic Data Detection Benchmark using Large Multimodal Models. It is hosted on the Hugging Face platform.
FlowVQA RAG is a dataset uploaded to Hugging Face by user 'kkyzl' on October 9, 2025. The dataset's title suggests it is designed for Visual Question Answering (VQA) tasks using a Retrieval-Augmented Generation (RAG) framework. Its specific content, scale, and structure require verification after download.
Mpdocvqa Corpus is a multimodal dataset published on HuggingFace by author AHS-uni. The dataset was last updated on June 8, 2025. Its specific content and scale are unknown from the provided metadata.
Human preference data collected from the r/WritingPrompts subreddit. The dataset was created by author euclaise and was last updated on December 25, 2023. The specific size, format, and column structure are not detailed in the provided metadata.
Indian Cartoon Blip is a dataset uploaded by Surbhipatil to the Hugging Face platform. The dataset was last updated on 2025-09-02 10:39:41. Its specific content, size, and structure are not detailed in the available metadata.
A-OKVQA is a dataset for visual question answering that requires external knowledge and reasoning. The dataset was created by HuggingFaceM4 and was last updated in February 2024.