Loading...
Loading...
Image-text pairs, instruction tuning, visual QA, cross-modal data, foundation model training data
1,551 datasets
150,000 GPT-generated multimodal instruction-following data points collected in April 2023. The dataset utilizes the GPT-4-0314 API to synthesize vision-language interactions for the development of large multimodal models.
A collection of 21,930,344 synthetic English captions for 10,965,172 images from the conceptual_12m dataset. The captions were generated using the llama3-llava-next-8b model, followed by cleanup and shortening with Meta-Llama-3-8B. The dataset was created by CaptionEmporium and last updated on Hugging Face in June 2024.
MINT-1T is an open-source multimodal interleaved dataset containing one trillion text tokens and 3.4 billion images, representing a 10x scale-up from prior open-source collections. It was created by a team from the University of Washington to facilitate research in multimodal pretraining. The dataset was last updated on the platform in September 2024.
MINT-1T is an open-source multimodal interleaved dataset containing 1 trillion text tokens and 3.4 billion images, a 10x scale-up from prior open-source collections. It includes previously untapped sources such as PDFs and ArXiv papers and is designed for multimodal pretraining research. The dataset was created by a team from the University of Washington and was last updated on the platform in September 2024.
CommonCatalog CC-BY provides approximately 100 million high-resolution images paired with synthetic captions, released by common-canvas in 2024. The collection originates from Yahoo Flickr data from 2014 and features images with resolutions up to 4k.
ChartMimic evaluates visually-grounded code generation in large multimodal models using information-intensive visual charts. The dataset was created by the ChartMimic team and was last updated in June 2025.
A multimodal mathematics dataset collected from real middle school exams in China, featuring open-ended problems. It was created by THU-KEG and last updated on June 30, 2024. The dataset is annotated with fine-grained three-dimensional labels for difficulty, grade, and knowledge points.
FRED is a large-scale multimodal dataset designed for drone detection, tracking, and trajectory forecasting. The dataset, authored by GabrieleMagrini, provides spatiotemporally synchronized RGB and event data. It was last updated on Hugging Face on October 3, 2025.
OmniCorpus-CC is a unified multimodal corpus of 10 billion-level images interleaved with text. It contains 988 million image-text interleaved documents collected from Common Crawl. The dataset was created by OpenGVLab and was last updated on the platform in March 2025.
EEE-Bench is a multimodal benchmark comprising 2,860 problems across 10 electrical and electronics engineering subdomains, including analog circuits and control systems. It was created by afdsafas and last updated on June 23, 2025. The benchmark is designed to evaluate the practical engineering capabilities of large multimodal models using complex visual inputs.
ToolVQA is a dataset introduced at ICCV 2025 for evaluating real-world tool-use capabilities in Large Foundation Models. The dataset is hosted on Hugging Face by author DietCoke4671 and was last updated on August 16, 2025. It is designed to address gaps in existing benchmarks for tool-augmented Visual Question Answering.
The dataset supports the 2026 Soccernet Challenge for multimodal (text, image, video) multiple-choice question answering. It covers 14 distinct soccer understanding tasks, including assessing player and team background knowledge, determining camera status, classifying actions, and recognizing fouls. The dataset was created by SoccerNet and last updated in October 2025.
Open-LLaVA-NeXT 1M is a 1 million sample dataset for supervised fine-tuning, created to reproduce the LLaVA-NeXT model series. The author augmented the sharegpt4v_mix665k dataset and attempted to align with LLaVA-NeXT's training data, substituting inaccessible user interaction data with 200K samples from ALLaVA-Instruct-VFLAN-4V. This dataset was uploaded to Hugging Face by Lin-Chen on October 25, 2024.
IceKhoffi's Chicken Health and Behavior Multimodal Dataset contains visual and audio data collected from chicken farms. It is designed for developing early detection systems for health issues and anomalous poultry behavior. The dataset was last updated on the Hugging Face platform in August 2025.
Approximately 1.6 million instruction-following examples for chest X-ray report generation, organized across 5 subsets. The dataset is designed for visual instruction tuning and building large multimodal models capable of generating structured radiology reports. It was authored by 'erjui' and last updated on Hugging Face on 2025-10-28.
MINT-1T is an open-source multimodal interleaved dataset containing 1 trillion text tokens and 3.4 billion images, representing a 10x scale-up from previous open-source collections. It was created by a team from the University of Washington and includes sources such as PDFs and ArXiv papers to facilitate multimodal pretraining research. The dataset was last updated on the platform in September 2024.
A training and evaluation corpus for VDocRAG, a retrieval-augmented generation framework designed to understand real-world documents from visual features. The dataset is a unified collection of open-domain document visual question answering data, encompassing diverse document types and formats. It was created by NTT-hil-insight and last updated on 2025-05-26.
A curated collection of high-quality synthetic Python unit tests derived from two code instruction tuning datasets: CodeFeedback-Filtered-Instruction and the training set of TACO. The dataset was created by author KAKA22 and last updated on 2025-01-20. It was used to train CodeRM-8B, a unit test generation model.
FactualVQA (FVQA) is a multimodal Visual Question Answering dataset created for search-augmented training and evaluation. It was authored by lmms-lab and last updated on 2025-08-09. The dataset emphasizes knowledge-intensive questions that require external information beyond the given image.
GameQA-5K is a dataset of 5,000 training samples extracted from the larger GameQA-140K dataset. It was created by the OpenMOSS-Team and published on Hugging Face in June 2025 for use in training models via the GRPO method. The data is synthesized from game code to enhance multimodal reasoning in vision-language models.