Loading...
Loading...
Image-text pairs, instruction tuning, visual QA, cross-modal data, foundation model training data
1,551 datasets
12.8 million image URLs and their corresponding CLIP embeddings derived from the datacomp_small benchmark. The dataset is processed via the Fondant framework to provide a production-ready format for multimodal machine learning tasks without requiring raw image storage.
Dasool's visual question answering dataset focuses on butterflies and moths. It is designed to benchmark Vision-Language Models for tasks like fine-grained species identification and ecological reasoning. The dataset was last updated on 2025-02-18.
A dataset of Python LeetCode problems intended for training and evaluating large language models for code. It was created by author 'newfacade' and last updated on Hugging Face on 2025-05-29. The dataset's specific size and structure are not detailed in the provided metadata.
FunBench is a novel visual question answering benchmark designed to evaluate multimodal large language models' fundus reading skills. The dataset was created by AIMClab-RUC and last updated on May 14,ๆไปฌๅ็ฐไบไธไธช้ฎ้ขใ 2025. Code and a description are available on a linked GitHub repository.
A Thai translation of the LLaVA-CC3M-Pretrain-595K dataset, originally created by Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. This dataset is intended for pre-training large multimodal models with Thai language capabilities and was uploaded to Hugging Face by user 'worapob' on March 9, โ2025.
Encompassing 0.5 million synthetic Chinese document images generated by the SynthDoG tool for training the Donut model. It is part of a multi-language collection created by naver-clova-ix and was last updated in January 2024.
4,000,000 image-caption pairs stored in PyArrow IPC format for high-performance multimodal training. The dataset utilizes memory-mapped files to enable low-latency data access during large-scale model optimization.
Zaynoid published this dataset on Hugging Face on December 17, 2025. The title suggests it is likely a collection of 1,000 items for training vision-language models. The specific content and structure require verification after download.
Aggregating Flickr30K image caption quintets used to compute denotational similarities for semantic inference tasks. It was created by embedding-data and last updated in August 2022.
2023 images were pulled from Pexels, primarily depicting people holding objects. The dataset includes full images paired with captions generated by the CogVLM model. It was created by author 'lodestones' and last updated on the platform in June 2024.
A subset of approximately 15 million image-text pairs from the YFCC100M dataset, curated for training vision-language models. It was prepared by author vishaal27 and uploaded to Hugging Face in January 2024. The dataset provides page URLs and direct image download URLs for each entry.
The ActivityNet Captions dataset contains 20,000 videos, each annotated with an average of 3.65 temporally localized descriptive sentences, resulting in 100,000 total sentences. Each sentence describes a unique video segment and has an average length of 13.48 words. The dataset was created by Leyo.
Facebook AI created this benchmark dataset to measure progress in multimodal reasoning for hate speech detection. The dataset pairs images with text to form memes, each labeled for hateful content. It was published in 2020 and last updated on the platform in December 2022.
The dataset connects 20,000 videos to temporally annotated sentence descriptions. On average, each video contains 3.65 temporally localized sentences describing unique segments and multiple events.
IndicMMVet is a dataset for evaluating Large Vision-Language Models on Visual Question Answering tasks, created by krutrim-ai-labs. It focuses on integrated capabilities and multilingual content, specifically for Indian contexts. The dataset was last updated on March 5, 2025.
28,408 images from Open Images paired with 142,040 captions that require models to read and reason about text within the visual scene. This version is specifically formatted for the lmms-eval pipeline to facilitate standardized benchmarking of large multi-modality models.
Longitudinal data from 9 to 12 months of age analyzes the role of touch in infant language development. The dataset includes databases for analysis and video clips with examples of categorized behaviors. It was authored by Murillo Sanz, Eva and last updated on October 14, 2025.
Containing between 100,000 and 1,000,000 Midjourney V6 images re-captioned using the LLaVA-1.6 vision-language model. Released by brivangl in October 2024, the data serves as an augmented version of the CortexLM/midjourney-v6 repository for multimodal research.
300,000 examples of visual instruction data for training multimodal large language models. The dataset combines 150,000 English examples from the LLaVA project and 150,000 from the openbmb project. Author BUAADreamer uploaded this collection to Hugging Face on September 2, 2024.
OpenViVQA provides over 11,000 images paired with more than 37,000 open-ended question-answer pairs in Vietnamese. The dataset was created by uitnlp for the VLSP 2023 - ViVRC shared task challenge and was last updated in December 2023.