Loading...
Loading...
Image-text pairs, instruction tuning, visual QA, cross-modal data, foundation model training data
1,551 datasets
A curated subset of 6,000 paintings from the WikiArt collection, created by Lizagrin and last updated in October 2025. It was developed for multimodal art retrieval, combining visual, textual, and semantic information. Each artwork record includes an image row index and an automatically generated caption using the BLIP model.
This dataset comprises a portion of the ThinkMorph-7B training corpus across four visual reasoning categories: Jigsaw Assembly, Spatial Navigation, Visual Search, and Chart Refocus. It utilizes an interleaved format to support cross-modal interactions and varying levels of visual engagement.
47,400 curated problems spanning mathematics and programming domains specifically formatted for reinforcement learning. The collection includes 39,000 math problems from sources like AoPS and DeepMath-103K, alongside approximately 8,400 coding challenges.
MathCoder-VL is a series of open-source large multimodal models tailored for general math problem-solving. The dataset likely contains multimodal math problems combining visual and textual elements. It was created by MathLLMs and last updated on October 11, 2025.
CharXiv is a diverse and challenging benchmark for chart understanding, fully curated by human experts. It includes 2,323 high-resolution charts manually sourced from arXiv preprints. The dataset was created by princeton-nlp and released in 2024.
MathCanvas-Edit contains 5.2 million step-by-step editing trajectories for mathematical images. The dataset was created by author shiwk24 and was last updated on the Hugging Face platform in November 2025. It forms a core component of the MathCanvas framework for training large multimodal models.
The dataset integrates images from the Infographic_vqa and AFTDB (Arxiv Figure Table Database) collections. It consists of image-text pairs, with each image linked to an average of five questions and answers available in both English and French. The dataset was created by cmarkea and last updated on Hugging Face in August 2024.
FUSION-10M is a large-scale dataset of image-caption pairs designed for pretraining multimodal AI models. It builds upon established datasets like LLaVA, ShareGPT4, and PixelProse and includes 2 million synthesized task-specific pairs. The dataset was created by author starriver030515 and was last updated in April 2025.
Examples of harmful and harmless language. It aggregates samples from seven source datasets, including Anthropic/hh-rlhf and allenai/real-toxicity-prompts. The data is available in both Portuguese and English.
218 million image-text pairs comprise the BLIP3-KALE dataset, featuring knowledge-augmented dense captions. It was created by Salesforce and last updated on February 3, 2025. The dataset is designed to combine web-scale knowledge with detailed image descriptions.
ChiMed-VL-Alignment is a multimodal dataset containing 580,014 Chinese medical image-text pairs. The dataset was created by author 'williamliu' and was last updated on the Hugging Face platform in December 2023. Pairs are categorized into context information and image-specific descriptions, with the context category containing 167 million tokens and descriptions containing 63 million tokens.
595,000 image-text pairs form a subset of the CC-3M dataset, filtered for balanced concept coverage. It was created by liuhaotian for the pretraining stage of visual instruction tuning, aiming to build large multimodal models. The dataset was last updated on July 6, 2023.
The dataset integrates table images from the AFTdb (Arxiv Figure Table Database) curated by cmarkea. Each image is paired with LaTeX source code and linked to an average of ten questions and answers, half in English and half in French. Questions and answers were generated using Gemini 1.5 Pro and Claude 3.5 Sonnet, and the dataset was last updated on 2024-09-26.
A dataset for mixed-modal instruction tuning created by researchers at the University of California, Los Angeles. It is designed for training biomedical assistants by integrating multimodal information. The dataset page was last updated on 2025-07-19.
6,000 multimodal question-answer pairs presented in the EchoInk-R1 research paper. The dataset was created by author harryhsing and last updated on the Hugging Face platform in May 2025. It is designed for exploring audio-visual reasoning in multimodal large language models via reinforcement learning.
3,167 completed human-computer interaction tasks featuring video, screenshots, and DOM snapshots, released by Paradigm Shift AI in 2025. This multimodal collection captures granular interaction events to support the development of AI agents capable of navigating graphical user interfaces.
A multimodal spiritual dataset featuring Ikaros in Spanish, Jiv Jago in Hindi, and languages of Russia, CIS, and Ukraine. The dataset was created by nativemind and was last updated on October 24, 2025. It includes enhanced multimodal data and supports up to 50 language examples.
A multimodal dataset for Point of Interest recommendation based on the Yelp Open Dataset. It includes business metadata, user reviews, business photos, and LLM-generated summaries of reviews and images. The dataset was uploaded by wzehui on September 8, 2025.
Over 8,700 labels and descriptions were generated for 1,252 Vietnamese handwriting images using the Gemini 1.5 Flash model. The dataset was created by 5CD-AI from the train splits of the Cinnamon AI Challenge and UIT-HWDB datasets. It was last updated on Hugging Face in August 2024.
GuanacoDataset is a multimodal visual question answering dataset intended for aligning vision-language models with large language models. The dataset's creator is JosephusCheung, and it was last updated in April 2024, though its current availability on Hugging Face is uncertain.