Loading...
Loading...
Image-text pairs, instruction tuning, visual QA, cross-modal data, foundation model training data
1,551 datasets
Arc2Face contains approximately 21 million facial images representing 1 million unique identities at a resolution of 448x448 pixels. The dataset was created by upsampling half of the WebFace42M database using a blind face restoration network for the Arc2Face foundation model research.
100 million Chinese image-text pairs form a subset of the Noah-Wukong multimodal dataset. The dataset was uploaded by author 'wanng' to Hugging Face and last updated on December 11, 2022. The text metadata for these pairs occupies approximately 16GB of space.
Visual Haystacks (VHs) is a benchmark dataset designed to evaluate Large Multimodal Models' capability to handle long-context visual information. It is described as the first vision-centric Needle-In-A-Haystack benchmark. The dataset was created by tsunghanwu and was last updated on Hugging Face on October 16, 2024.
Between 100,000 and 1,000,000 multimodal conversational records comprise this dataset released by trl-lib in 2025. It facilitates instruction tuning by pairing images with multi-turn dialogue prompts and target completions. The data is structured specifically for language modeling and visual-text alignment tasks.
Released in 2024, TemporalBench is a video understanding benchmark designed to evaluate fine-grained temporal reasoning for multimodal video models. It consists of approximately 10,000 video question-answer pairs sourced from around 2,000 high-quality human-annotated video captions. The dataset was created by Microsoft.
665,000 multimodal instruction-following pairs consisting of images and text sequences, compiled by kaiyuyue and updated in 2025. This collection consolidates the LLaVA-1.5-665K mixture into a single repository, providing raw images in WebDataset format alongside instruction JSONs.
A multimodal dataset for low-resource language translation, as described in the paper 'From Text to Multi-Modal: Advancing Low-Resource-Language Translation through Synthetic Data Generation and Cross-Modal Alignments'. The paper was accepted by the LoResMT 2025 workshop at NAACL 2025. The dataset was uploaded by author 'qianstats' to Hugging Face on August 13, 2025.
3,763 web-collected videos with subtitles and multiple-choice questions comprise this long-context multimodal benchmark. Created for NeurIPS 2024, it evaluates large multimodal models on video-language interleaved inputs with durations reaching up to one hour.
MathCoder-VL is a series of open-source large multimodal models tailored for general math problem-solving. The dataset likely contains 8.6 million multimodal examples pairing images with code, supporting the development of models like FigCodifier-8B. It was created by MathLLMs and updated on October 11, 2025.
MSR-VTT contains 10,000 video clips paired with 200,000 descriptive captions. The dataset, originally created by Microsoft Research, is a standard benchmark for text-video retrieval and captioning tasks. It was last updated on the platform in August 2025.
MME-CoT is a benchmark dataset for evaluating Chain-of-Thought reasoning in Large Multimodal Models. It was created by author CaraJ and published on Hugging Face, with its last update recorded on 2025-03-19. The dataset focuses on assessing reasoning quality, robustness, and efficiency.
This robotics dataset contains 3,000 episodes and 149,985 frames of multimodal data collected from a Kuka robot arm. Released by the LeRobot team and associated with research paper 1810.10191, the collection provides 20 FPS video and time-series sensor data for a single robotic task.
FragFake is a dataset for edited-image detection using Vision-Language Models (VLMs). It contains four groups of examplesโGemini-IG, GoT, MagicBrush, and UltraEditโeach with two difficulty levels: easy and hard. The dataset was created by Vincent-HKUSTGZ and was last updated on July 31, 2025.
2,000 rows of preference data for Direct Preference Optimization (DPO) fine-tuning, structured with prompt, chosen, and rejected fields. The chosen responses and prompts are sourced from the iamtarun/python_code_instructions_18k_alpaca dataset, while rejected responses are generated by a base LLAMA 3.1 model. The dataset was uploaded by quangduc1112001 to Hugging Face and last updated on November 4, 2024.
Relation252K contains source-target image pairs across 218 distinct image editing tasks, released by handsomeWilliam in 2025. It serves as the evaluation set for the RelationAdapter model, focusing on the transfer of visual relations within Diffusion Transformers.
Harmonized Landsat and Sentinel-2 multispectral reflectance imagery and MERRA-2 observations centered around eddy covariance flux towers. The dataset includes corresponding Gross Primary Productivity data and is intended to fine-tune geospatial foundation models for GPP regression. It was created by ibm-nasa-geospatial and last updated on October 25, 2024.
Robo2VLM 1 provides between 100,000 and 1,000,000 visual question-answering records derived from real-world robot manipulation trajectories. Created by researcher keplerccc and updated in late 2025, the dataset uses multi-modal robot data to enhance scene understanding in vision-language models. It bridges the gap between internet-scale image-text corpora and specific robotic visuomotor policies.
ViInfographicVQA is a Vietnamese Visual Question Answering benchmark for infographic understanding. It likely contains data-rich visuals mixing text, charts, maps, and design elements. The dataset was created by duytranus and was last updated on November 14, 2025.
A subset of 12 million image-text pairs from the DataComp-1B-BestPool collection, released by mlfoundations in 2024. The dataset is designed for training image-text models and is licensed under Creative Commons CC-BY-4.0, though individual images retain their original copyrights. It was introduced in the MobileCLIP paper and is reported to yield better model performance than several established benchmarks.
10,000 to 100,000 multimodal records for cold-start supervised fine-tuning (SFT) in reasoning tasks, released by WaltonFuture in 2025. It supports the research paper 'Advancing Multimodal Reasoning via Reinforcement Learning with Cold Start' by providing initial training data for a two-stage reinforcement learning pipeline.