DataSalon

Discover quality datasets for AI training — aggregated from 40+ platforms, curated by AI.

ProductSearch Datasets Browse Topics Rankings Community API / MCP

ResourcesDocumentation Blog Changelog Status

LegalPrivacy Policy Terms of Service Cookie Policy

Multimodal & LLM Datasets | DataSalon

All Categories

🔗

Multimodal & LLM

Image-text pairs, instruction tuning, visual QA, cross-modal data, foundation model training data

1,956 datasets

Multimodal & LLM

SlideVQA: Multi-Image Document Question Answering on 10K+ Slide Decks

SlideVQA is a document visual question answering dataset containing between 10,000 and 100,000 records, released by NTT-hil-insight in 2023. It focuses on multi-image reasoning where models must select specific evidence slides from a deck to answer natural language questions.

ParquetSize Categories10 Kn100 KLibrarypolarsTask Categoriesquestion AnsweringLibrarydaskLanguageenTask Categoriesvisual Question AnsweringModalitytextLibrarymlcroissantModalityimageLibrarydatasetsArxiv230104883Regionus+1

0 views

Multimodal & LLM

COIG-P: Chinese Preference Dataset for Human Value Alignment

A Chinese preference dataset developed for alignment with human values, as described in the associated research paper. The dataset was created by author m-a-p and was last updated on HuggingFace on 2025-04-15. Its specific scale and content are detailed in the paper 'COIG-P: A High-Quality and Large-Scale Chinese Preference Dataset for Alignment with Human Values'.

TextAlignmentPreference DataChinese LanguageLarge Scale+1

0 views

Multimodal & LLM

OmniCorpus CC 210M: 210 Million Image-Text Interleaved Documents

OpenGVLab's OmniCorpus CC 210M dataset contains 210 million image-text interleaved documents filtered from the Common Crawl web corpus. The dataset is designed for large-scale vision-language model training, as described in an ICLR 2025 spotlight paper. It was last updated on the Hugging Face platform in March 2025.

MultimodalParquetLibrarypolarsTask Categoriesimage To TextLibrarydaskLanguageenTask Categoriesvisual Question AnsweringSize Categories100 Mn1 BModalitytextArxiv240608418LibrarymlcroissantVision LanguageImage TextCommon CrawlLibrarydatasetsLicensecc By 40Computer VisionRegionusLarge ScaleNatural Language Processing+1

0 views

Multimodal & LLM

Video R1 Eval: Evaluation Dataset for Reinforcing Video Reasoning in MLLMs

Evaluation benchmarks for the Video-R1 model across video reasoning categories, including test sets for temporal and causal logic. The dataset provides the data required to replicate the reasoning performance results presented in the 'Video-R1: Reinforcing Video Reasoning in MLLMs' research paper. It is designed to test the logical and temporal inference capabilities of Multimodal Large Language Models.

Arxiv250321776RegionusLicenseapache 20+1

0 views

Multimodal & LLM

WebDev Arena Preference 10K: LLM Battles for Web Development Tasks

10,000 real-world WebDev Arena battles involving 10 state-of-the-art large language models (LLMs). The dataset was created by lmarena-ai and was last updated on March 10, 2025. It is hosted on the Hugging Face platform.

TextAi BenchmarkingLlm EvaluationPreference DataWeb Development+1

0 views

Multimodal & LLM

VisRAG-Ret-Test-ArxivQA: Visual Question Answering Dataset from arXiv Figures

A dataset for visual question answering based on figures extracted from arXiv publications. It originates from the ArXiVQA dataset within the Multimodal ArXiv collection. The dataset was created by openbmb and was last updated on March 15, 2025.

MultimodalMultimodal AiNatural Language ProcessingArxiv FiguresAcademic PublicationsVisual Question Answering+1

0 views

Multimodal & LLM

Mkgformer: Multimodal Knowledge Graph Data for SIGIR 2022 Benchmarks

Multimodal knowledge graph completion data featuring text and image modalities for link prediction and relation extraction. Released by zjunlp for the SIGIR 2022 conference, it supports the training of hybrid transformer models for knowledge graph enrichment.

MultimodalLink PredictionNamed Entity RecognitionMkgformerRelation ExtractionKnowledge GraphPytorchMnreMkgSigir2022TransformerKgFormerKgc+1

0 views

Multimodal & LLM

Visual Prompt Injection Benchmark for Cybersecurity

CyberSecEval 3 Visual Prompt Injection is a multimodal benchmark from Meta for evaluating cybersecurity risks in LLMs. It contains text and image inputs designed to test visual prompt injection vulnerabilities. The dataset is part of a larger security benchmark suite and was last updated in March 2025.

JSONSize Categories1 Kn10 KTask Categoriestext GenerationPrompt InjectionLibrarypolarsLanguageenModalitytextLibrarymlcroissantArxiv240801605ModalityimageArxiv231117600LibrarydatasetsLibrarypandasRegionusAi SecurityLicensemit+1

0 views

Multimodal & LLM

HH RLHF Safety V3 DPO: Human Preference Data for LLM Safety Tuning

This dataset inherits from the original Anthropic/hh-rlhf collection and has been formatted using the OpenAI chat convention for Direct Preference Optimization (DPO) fine-tuning. Each conversational response has been labeled for safety using the LLaMa Guard model. The dataset was uploaded by author javirandor and last updated on March 28, 2025.

TextParquetSize Categories1 Kn10 KChat ConversationsSafetyLibrarypolarsModalitytextLibrarymlcroissantLibrarydatasetsLibrarypandasRegionusLlm TrainingDpoHuman FeedbackLicensemit+1

0 views

Multimodal & LLM

DaTikZv3: TikZ Drawings Aligned with Captions

TikZ drawings and natural language captions are paired to facilitate the automated generation of LaTeX-based diagrams. This public version excludes certain drawings due to licensing but provides tools for full dataset recreation via the DaTikZ repository.

ParquetLicenseotherLibrarypolarsLibrarydaskModalitytextSize Categories100 Kn1 MLibrarymlcroissantModalityimageLibrarydatasetsRegionus+1

0 views

Multimodal & LLM

ShowUI-desktop-8K: 8,000 PC Screenshots with GPT-4o Augmented Annotations

ShowUI-desktop-8K consists of approximately 8,000 PC-based UI grounding records featuring screenshots and annotations originally sourced from OmniAct. Created by showlab and updated in March 2025, the dataset provides visual and textual data for desktop interface interaction research. It utilizes GPT-4o to augment original labels with detailed attributes regarding appearance and functionality.

ParquetSize Categories1 Kn10 KLibrarypolarsLibrarydaskArxiv241117465ModalitytimeseriesModalitytextLibrarymlcroissantModalityimageLibrarydatasetsRegionus+1

0 views

Multimodal & LLM

R1 Vision Reasoning Instructions: A Vision-Language Instruction Dataset

A dataset for vision reasoning instruction tuning, released in 2025. The data is authored by Di Zhang and was last updated on March 6, 2025. It appears to be derived from the LLaVA-CoT-100k dataset, with images and raw data hosted on separate Hugging Face repositories.

MultimodalVision LanguageMultimodal AiComputer VisionReasoning+1

0 views

Multimodal & LLM

ShareGPT4Video: 4.8 Million GPT-4V Generated Video Captions

ShareGPT4Video provides 4.8 million multi-modal video captions generated via GPT-4-Vision to improve modality alignment in Large Video-Language Models. Developed by the ShareGPT4Video team in 2024, the collection includes a specific 40,000-record subset for fine-grained visual perception tasks.

JSONSize Categories10 Kn100 KLibrarypolarsTask Categoriesquestion AnsweringLanguageenTask Categoriesvisual Question AnsweringArxiv240604325ModalitytextLibrarymlcroissantModalityimageLibrarydatasetsLibrarypandasModalityvideoLicensecc By Nc 40Regionus+1

0 views

Multimodal & LLM

ALM-Bench: A Multimodal Benchmark for Culturally Diverse and Low-Resource Languages

A benchmark for evaluating Large Multimodal Models (LMMs) on cultural context, local sensitivities, and low-resource language support, integrating visual cues. The dataset was created by MBZUAI and was last updated on February 28, 2025. It is associated with a CVPR 2025 publication.

MultimodalLanguage EvaluationCultural ContextBenchmarkComputer VisionMultimodal Benchmark+1

0 views

Multimodal & LLM

Takara Image Captions: 1 Million Curated Multimodal Pairs

Over 1 million curated image-caption pairs were released by the Frontier Research Team at takara.ai in February 2025. The collection was produced by consolidating and standardizing multiple open-source datasets through a 96-hour computational validation process across three nodes.

ParquetLibrarypolarsTask Categoriesimage To TextLibrarydaskSize Categories1 Mn10 MLanguageenTask Categoriestext To ImageModalitytextLibrarymlcroissantModalityimageLibrarydatasetsRegionusSynthetic+1

0 views

Multimodal & LLM

OmniAlign-V: 205k Samples for Multimodal LLM Human Preference Alignment

205k high-quality samples for aligning Multimodal Large Language Models with human preferences. The dataset was created by PhoenixZ and is associated with the paper 'OmniAlign-V: Towards Enhanced Alignment of MLLMs with Human Preference'. It was last updated on March 1, 2025.

MultimodalVision LanguageHuman AlignmentMultimodal LlmAi Training+1

0 views

Multimodal & LLM

OmniAlign-V-DPO: 150k Preference Pairs for Multimodal LLM Alignment

OmniAlign-V-DPO datasets contains 150,000 high-quality positive-negative pairs for Direct Preference Optimization (DPO). It is based on the OmniAlign-V datasets and was created by PhoenixZ. The dataset was last updated on March 1, 2025.

MultimodalAlignmentVision LanguageMultimodal LlmHuman PreferencesDpo+1

0 views

Multimodal & LLM

M4U: Multilingual Multimodal Understanding Benchmark for AI Models

M4U-Benchmark created a dataset for evaluating multilingual understanding and reasoning in large multimodal models. The dataset was made publicly available on May 23, 2024, and is hosted on Hugging Face. It likely contains paired text and image data designed to test AI models across multiple languages.

MultimodalMultilingualParquetSize Categories1 Kn10 KArxiv240515638LibrarypolarsLanguagezhLanguageenEngineeringTask Categoriesvisual Question AnsweringAi EvaluationModalitytextLibrarymlcroissantModalityimageBiologyLibrarydatasetsBenchmarkLibrarypandasChemistryRegionusScienceLanguagedeLicensemitVisual Question AnsweringMedical+1

0 views

Multimodal & LLM

DetailCaps-4870: Benchmark for Detail Image Caption Evaluation

DetailCaps-4870 is an evaluation benchmark for detail image captioning proposed in the paper 'Benchmarking and Improving Detail Image Caption'. It contains 4,870 images curated from various datasets, accompanied by ground truth detail captions generated by GPT-4V, Gemini-1.5-Pro, and GPT-4O. The dataset also includes captions generated by three open-source large vision-language models: LLaVA-1.5, CogVLM, and ShareCaptioner.

MultimodalLlm BenchmarkVision LanguageBenchmarkComputer VisionImage CaptioningMultimodal EvaluationSynthetic+1

0 views

Multimodal & LLM

DataComp 12M Image-Text Pair Identifiers

12 million unique identifiers (UIDs) reference a filtered subset of the larger DataComp-1B-BestPool dataset. Apple created this collection to train image-text models that outperform those trained on established benchmarks like CC-12M and YFCC-15M. The dataset card was last updated in February 2025.

MultimodalImage Text PairsContrastive LearningMultimodal TrainingComputer VisionClip Models+1

0 views

PreviousPage 78 of 98Next