Loading...
Loading...
Image-text pairs, instruction tuning, visual QA, cross-modal data, foundation model training data
1,551 datasets
M2KR-Challenge is a multimodal retrieval dataset created by Jingbiao and last updated on February 4, 2025. It contains 6.42k query samples with images and optional text, and a collection of 47.3k textual passages with associated web screenshots. The dataset is designed for image-to-document and image+text-to-document matching tasks.
This multimodal agent benchmark evaluates AI performance within simulated clinical environments using language agents. It adapts the MedQA dataset to facilitate interactive diagnostic reasoning between AI doctors and simulated patients across various medical scenarios.
Presenting a synthetic preference dataset for instruction tuning, developed by the LLM-jp collaborative project in Japan. It is specifically aimed at ensuring the safety and appropriateness of large language model outputs in Japanese. The dataset was last updated on February 2, 2025.
Gold-standard benchmark for document alignment between Sinhala, Tamil, and English languages. It contains manually annotated document pairs crawled from four Sri Lankan news websites: Army, Hiru, ITN, and Newsfirst.
MathV360K is a multimodal dataset containing 360,000 question-answer pairs and 40,000 images sourced from 24 datasets. It was created by Zhiqiang007 and uploaded to Hugging Face on 2024-06-27 to enhance the mathematical reasoning capabilities of multimodal large language models.
A formatted evaluation suite for large multi-modality models (LMMs), created by lmms-lab and last updated on March 8, 2024. It is designed to accelerate LMM development by enabling one-click evaluations through the lmms-eval pipeline. The dataset is based on the MM-Vet benchmark described in the associated research paper.
1-hour videos and v1.0 development set annotations for long-form video-language understanding. This benchmark from Stanford University was introduced at NeurIPS 2024 to evaluate models on extended temporal sequences.
Presenting a gold-standard benchmark dataset for sentence alignment between Sinhala, English, and Tamil languages. The data was crawled from news websites including Army, Hiru, ITN, and Newsfirst, with aligned sentences derived from a prior document alignment dataset.
A demonstration dataset for Vision-Language-Action model finetuning, collected via teleoperation of a robot arm performing a basic pick-and-place task. The data was gathered by author IliaLarchenko using a modified LeKiwi setup with three cameras and was last updated in September 2025.
Instruction Tuning with GPT-4 is the title of the associated research paper. The dataset was created by the team llm-wizard and last updated on April 7, 2023. It is licensed for non-commercial research use under CC BY NC 4.0.
SWE-bench Multimodal provides 617 task instances for evaluating AI systems on real-world software engineering problems. The dataset, created by SWE-bench, was last updated on April 29, 2025. It is designed to test the ability of language models to resolve actual GitHub issues.
Dense English captions for the CommonCatalog CC-BY image collection generated via the Phi-3 Vision model. The data is structured in a CSV format where each entry is linked to the original image repository through a unique photoid primary key.
HQD4VLM is a dataset curated for vision-language model research. The dataset likely contains filtered samples intended to reduce noise and improve training efficiency. It was created by author Nhanvi282 and last updated on January 11, 2025.
LLaVAR provides a collection of 422,000 pretraining and 16,000 to 20,000 instruction-following data pairs for training multimodal AI models. Created by SALT-NLP, this dataset enhances visual instruction tuning by focusing on images containing text. The dataset was released and last updated in July 2023.
M4U-Benchmark created a dataset for evaluating multilingual understanding and reasoning in large multimodal models. The dataset was made publicly available on May 23, 2024, and is hosted on Hugging Face. It likely contains paired text and image data designed to test AI models across multiple languages.
COYO-700M is a large-scale dataset containing 747 million image-text pairs with additional meta-attributes. It was created by KakaoBrain using a strategy of collecting informative alt-text and associated images from HTML documents. The dataset was last updated on August 30, 2022.
ImageCoDe is a vision-and-language benchmark requiring contextual understanding of pragmatics, temporality, long descriptions, and visual nuances. The dataset was created by BennoKrojer and last updated on May 13, 2022. The specific row count, column count, and dataset size are unknown.
Tasksource provides the OASST1 dataset preprocessed for reward modeling. It contains pairwise human feedback data for training reinforcement learning from human feedback (RLHF) reward models, focusing on conversational AI and multilingual text.
Aggregating an export of all XKCD comics, including their transcript and explanation scraped from explainxkcd.com. It includes fields such as comic title, image URL, transcript, and explanation URL.
S3E is a multimodal dataset for collaborative Simultaneous Localization and Mapping (SLAM) created by PengYu-Team. The dataset was last updated on May 15, 2025. It is designed for multi-robot systems and includes experimental sequences captured in a laboratory environment.