Loading...
Loading...
Image-text pairs, instruction tuning, visual QA, cross-modal data, foundation model training data
1,551 datasets
APTO-001 developed this dataset to improve large language model instruction-following capabilities. The dataset likely contains synthetic text examples designed to train models on handling complex, multi-step instructions, as described in the platform description. It was last updated on September 12, 2025.
MME-RealWorld is a multimodal benchmark dataset for evaluating large language models, launched on August 20, 2024. The dataset includes a lite version with 50 samples per task for inference acceleration, as noted in the November 14, 2024 update. It is authored by yifanzhang114 and is supported by evaluation frameworks like VLMEvalKit and Lmms-eval.
MJ Showcase 2024 is a dataset of top-voted AI art creations manually collected daily between May and August 2024. The dataset includes 8,551 rows and provides both images and their associated text prompts. It was created by author shb777 and last updated on Hugging Face in August 2025.
Approximately 6,000 human responses from around 2,000 annotators were collected to evaluate the Seedance 1 Pro video generation model on a benchmark. The data was gathered in roughly 5 minutes using the Rapidata Python API. The dataset was published by Rapidata and last updated on August 11, 2025.
DEJIMA is a large-scale Japanese multimodal dataset containing 3.88 million image-caption pairs and 3.88 million image-question-answer pairs. It was created by MIL-UT using a reproducible pipeline involving web-scale image collection, strict filtering, evidence extraction, and LLM-based annotation under grounding constraints. The dataset was last updated on December 2, 2025.
SimLingo-Data consists of 3,308,315 samples of autonomous driving data generated in the CARLA 2.0 simulator by RenzKa. It integrates sensor readings and action labels with natural language annotations for driving commentary, instruction following, and visual question answering, collected using the PDM-Lite rule-based expert.
CSVQA is a Chinese multimodal benchmark designed to evaluate the STEM reasoning capabilities of Vision-Language Models. The dataset was created by Skywork and its associated paper was released on arXiv in June 2025. It focuses on scientific visual question answering, combining images with text in Chinese.
A subset of 89,440 videos with 608,000 event instances, annotated for temporal grounding. The dataset was created by yingsen and last updated on August 1, 2025. It is derived from the InternVid-FLT video-text alignment data through an automated annotation process detailed in the associated paper.
Cambrian Vision-Centric Benchmark (CV-Bench) is a dataset introduced in the Cambrian-1 research paper for evaluating vision-centric multimodal large language models. The dataset contains annotations and images pre-loaded for processing with Hugging Face Datasets. It was created by nyu-visionx and last updated on July 20, 2025.
Over 9.3 million synthetically generated image-text pairs form this multimodal dataset created for training the SmolDocling model. The dataset covers code snippets from 56 different programming languages, with text sourced from permissively licensed sources and images generated at 120 DPI using LaTeX and Pygments. It was created by the docling-project and last updated on July 16, -2025.
Over 700 anonymized images, primarily captured from vehicles, form this multimodal benchmark. Each image is paired with a question and a verifiable answer, designed to test real-world scene understanding. The dataset was released by xAI in April 2024.
29 pages of documentation for Anthropic's Claude Code, crawled on 2025-06-24. The dataset contains 27,764 words formatted into 29 chunks. It was prepared by author 'ratanon' for use in LLM training and RAG systems.
Over 4000 annotated samples capture the welding process through video, audio, sensor time-series, and post-weld images. IntelLabs collected this data in an automotive production floor setting with an industry supplier. The dataset was published in September 2025 to support multimodal defect detection research.
From 1989 onwards, this dataset contains multimodal analyses of front pages from two Spanish newspapers, El Paรญs and ABC, covering the fall of the Berlin Wall. It was created by Silvia Molina Plaza and focuses on layout structure and rhetorical argumentation. The dataset was last updated on October 14, 2025.
A high-confidence subset of VisualWebInstruct curated by TIGER-Lab, last updated October 24, 2025. It contains verified multimodal questionโanswer pairs where correctness, reasoning quality, and imageโtext alignment have been explicitly validated. The dataset is designed for Reinforcement Learning and Reward Model training pipelines.
550 annotated speech samples categorized across 11 distinct paralinguistic dimensions for speech-to-speech model evaluation. The dataset includes curated audio files and corresponding annotations derived from the Step-Audio 2 technical research.
4,000 multimodal instruction-tuning samples designed to instill Evidence-of-Thought (EoT) reasoning into Vision-Language Models for remote sensing. The dataset utilizes a Socratic questioning approach to guide models through logical, step-by-step interpretation of satellite and aerial imagery.
VectorInstitute released VLDBench in January 2026 as a large-scale benchmark for evaluating Vision-Language Models (VLMs) and Large Language Models (LLMs) on multimodal disinformation detection. The framework provides a testing ground for AI safety by presenting models with deceptive content that integrates both visual and textual modalities.
159,549 new question-answer pairs form the Kvasir-VQA-x1 dataset, a large-scale benchmark for medical visual question answering in gastrointestinal endoscopy. SimulaMet created this multimodal dataset to advance robust MedVQA systems. The dataset was featured in the MediaEval Medico 2025 Challenge and was last updated on Hugging Face in August 2025.
ALLaVA-4V is a multimodal dataset created by FreedomIntelligence using GPT-4V to generate detailed captions and complex reasoning question-answer pairs for images. The dataset incorporates data from sources like LAION and WizardLM, with its generation pipeline and prompts documented on the project page. It was last updated on June 8, 2025.