Loading...
Loading...
Image-text pairs, instruction tuning, visual QA, cross-modal data, foundation model training data
1,551 datasets
INS-MMBench is the first comprehensive benchmark for evaluating Large Vision-Language Models in the insurance domain. It covers four insurance typesโauto, property, health, and agriculturalโand key insurance stages. The dataset was created by FDU-INS and was last updated on Hugging Face in July 2025.
A dataset for vision reasoning instruction tuning, released in 2025. The data is authored by Di Zhang and was last updated on March 6, 2025. It appears to be derived from the LLaVA-CoT-100k dataset, with images and raw data hosted on separate Hugging Face repositories.
OBELICS is a massive, curated collection of 141 million English web documents containing 115 billion text tokens and 353 million images. The documents feature interleaved text paragraphs and images, extracted from Common Crawl dumps. It was created by HuggingFaceM4 and released in August 2023.
An open-sourced dataset and builder for prototyping olfaction-vision-language tasks in AI, robotics, and AR/VR domains. It is designed for applications like vision-scent navigation for drones or augmenting VR experiences with scent. Specific details on row count, column count, and file formats are not provided in the input.
JourneyBench Multi_Image_VQA is a test-only dataset for debugging multimodal reasoning models. It contains visual question answering examples requiring analysis across multiple images. The dataset was created by author hiyouga and last updated in April 2025.
Over 2 million images and videos form the core of the InstQA dataset, which also contains 6 million instance captions, 2 million image/video captions, and 10 million instance-level visual question answers. This dataset was created by wovenbytoyota-vai and was last updated on October 15, 2025. It is designed for instance-aware spatio-temporal visual question answering tasks.
KREAM Product Blip Captions is a dataset for finetuning text-to-image generative models. It consists of image and text pairs collected from KREAM, a major online resale market in Korea. The dataset was created by author hahminlew and was last updated on December 7, 2023.
A benchmark suite for evaluating cell phenotyping capabilities of pathology Foundation Models, created by Kainmueller-Lab and last updated on 2025-10-09. The collection includes four key datasets processed into the LMDB format to facilitate large-scale experimentation. The datasets are hosted on the Hugging Face platform.
LLaVA-Plus-v1-117K is a set of 117,000 GPT-generated multimodal tool-augmented instruction-following data points. It was collected in September 2023 by prompting the ChatGPT/GPT-4-0314 API to build large multimodal agents with vision and language capabilities. The dataset was created by the LLaVA-VL organization.
BigDocs-Bench is a benchmark suite introduced by ServiceNow for evaluating multimodal models on tasks that transform visual inputs into structured outputs. The dataset is associated with the paper 'BigDocs: An Open and Permissively-Licensed Dataset for Training Multimodal Models on Document and Code Tasks'. The benchmark data was initially released on 2024-12-10 and last updated on the platform on 2025-03-19.
Chatbot Arena Conversations JA (calm2) is a Japanese instruction dataset constructed for RLHF, as described in its associated paper. The dataset was created to test whether English datasets can be adapted for Japanese using only open-source tools and models. Prompts are Japanese translations of user inputs from the lmsys/chatbot_arena_conversations dataset, which are human-written and licensed under CC-BY 4.0.
MM-RLHF is a project for aligning Multimodal Large Language Models with human preferences. The release includes a high-quality alignment dataset and a strong critique-based reward model. The project was open-sourced by yifanzhang114 in February 2025.
VLFeedback contains 80,000 multi-modal instructions and 320,000 model responses annotated by GPT-4V for vision-language preference learning. Developed by MMInstruction in late 2023, the dataset aggregates instructions from diverse sources to evaluate a pool of 12 different Large Vision-Language Models (LVLMs).
SLAKE is a Semantically-Labeled Knowledge-Enhanced Dataset for Medical Visual Question Answering, presented at ISBI 2021. This version, uploaded by mdwiratathya, filters the original bilingual dataset to contain only English entries, providing images as PIL objects, questions, and answers. The dataset was last updated on the Hugging Face platform on June 14, 2024.
Released by NVIDIA in May 2025, this multimodal dataset contains pairs of videos and text annotations for embodied reasoning tasks. It includes data from BridgeDatav2, RoboVQA, Agibot, HoloAssist, AV, and RoboFail datasets. The annotations are structured for Supervised Fine-Tuning (SFT), Reinforcement Learning (RL), and benchmarking purposes.
MINT-1T contains 1 trillion text tokens and 3.4 billion images, a tenfold scale increase from prior open-source multimodal collections. Created by a University of Washington team, this dataset interleaves text and images from sources including ArXiv papers and PDFs to support multimodal pretraining research.
Text-only reasoning pairs and logic-based question-answer sets synthesized from game code across multiple game environments. This data utilizes game mechanics to facilitate training and evaluation of general reasoning in models via the Code2Logic framework.
700,000 Vietnamese vision-language samples were generated using Gemini Pro and prompt engineering techniques like few-shot learning and caption-based prompting. The dataset was created by Vi-VLM and was last updated in June 2024.
A subset of 2.66 million Chinese image-text pairs extracted from the LAION-5B-high-resolution multilingual multimodal dataset. The dataset was created by 'wanng' and last updated on Hugging Face in December 2022. The provided metadata file is approximately 381 MB and contains text information such as URLs, but does not include the actual image files.
Between 10,000 and 100,000 expert-annotated sentences comprise this dataset for token-level acronym identification in the scientific domain. Created by Amirveyseh for the AAAI-21 Workshop on Scientific Document Understanding, it includes standardized training, validation, and test splits.