DataSalon

Discover quality datasets for AI training — aggregated from 40+ platforms, curated by AI.

ProductSearch Datasets Browse Topics Rankings Community API / MCP

ResourcesDocumentation Blog Changelog Status

LegalPrivacy Policy Terms of Service Cookie Policy

Multimodal & LLM Datasets | DataSalon

All Categories

🔗

Multimodal & LLM

Image-text pairs, instruction tuning, visual QA, cross-modal data, foundation model training data

1,956 datasets

Multimodal & LLM

MER2024: Multimodal Emotion Recognition Challenge Dataset

MER2024 is a large-scale multimodal dataset released for the MER24 Challenge at IJCAI. It builds upon the MER23 and MRAC23 datasets from ACM Multimedia, expanding data volume and task diversity. The dataset aims to advance robust and practical multimodal emotion recognition.

MultimodalChallenge DatasetLanguageenMultimodal Emotion RecognitionAffective ComputingLicensecc By Nc 40RegionusLarge ScaleHuman Behavior+1

0 views

Multimodal & LLM

Vision-Language Instruction Data for 3R Tasks

A multimodal dataset for vision-language model training, hosted on HuggingFace by author Journey9ni. The dataset was last updated in June 2025 and is categorized as containing up to 100,000 entries. It is designed for tasks involving the 3R framework.

MultimodalJSONLibrarydaskVision Language ModelModalitytextSize Categories100 Kn1 MLibrarymlcroissantLibrarydatasetsMultimodal TrainingRegionusInstruction TuningLicenseapache 20+1

0 views

Multimodal & LLM

Describe Anything: 100K-1M Localized Image and Video Captions

NVIDIA, UC Berkeley, and UCSF released this collection of 100,000 to 1,000,000 records in 2025 for training Describe Anything Models (DAM). The data consists of localized image and video captions stored in WebDataset tar files to support vision-language tasks.

ImageVideoWEBDATASETTask Categoriesimage To TextLanguageenLibrarywebdatasetModalitytextSize Categories100 Kn1 MLibrarymlcroissantModalityimageTask Categoriesvideo Text To TextLibrarydatasetsModalityvideoRegionusArxiv250416072+1

0 views

Multimodal & LLM

S3E: A Multi-Robot Multimodal Dataset for Collaborative SLAM

S3E is a multimodal dataset for collaborative Simultaneous Localization and Mapping (SLAM) created by PengYu-Team. The dataset was last updated on May 15, 2025. It is designed for multi-robot systems and includes experimental sequences captured in a laboratory environment.

MultimodalCollaborative RoboticsBenchmarkMulti Robot SlamRobotics DatasetMultimodal Sensor Data+1

0 views

Multimodal & LLM

PLM-Image Auto: Synthetic Image Captions and Question-Answer Pairs

Synthetic annotations for images and documents created by Facebook for the PLM model. The dataset includes generated captions for images from SA1B, OpenImages, and Object365, and question-answer pairs for documents from ArXivQA, UCSF, and PDFAcc. The dataset was last updated on April 21, 2025.

MultimodalImage CaptionsMultimodal QaComputer VisionLlm TrainingSynthetic Annotations+1

0 views

Multimodal & LLM

DLC-Bench: Detailed Localized Captioning Benchmark for Images and Videos

DLC-Bench is a dataset for benchmarking detailed and localized image and video captioning. It was created by researchers from NVIDIA, UC Berkeley, and UCSF, including Long Lian, Yifan Ding, and others. The dataset was last updated on the Hugging Face platform on April 24, 2025.

ImageMultimodalIMAGEFOLDERImage To TextTask Categoriesimage To TextLanguageenSize Categoriesn1 KLibrarymlcroissantVision LanguageModalityimageLibrarydatasetsBenchmarkComputer VisionImage CaptioningRegionusArxiv250416072Detailed CaptioningLocalized CaptioningMultimodal Benchmark+1

0 views

Multimodal & LLM

PLM-Video Auto: Synthetic Video Captions and Multiple-Choice Questions

Synthetic annotations for video understanding tasks, covering the YT-1B and Ego4d datasets. The dataset includes video captions and multiple-choice question-answer pairs, as described in the associated technical report. It was created by Facebook and last updated on the Hugging Face platform in April 2025.

MultimodalVideo UnderstandingMultiple Choice QaSynthetic AnnotationsVideo Captions+1

0 views

Multimodal & LLM

Mpdocvqa Corpus: A Multimodal Visual Question Answering Dataset

Mpdocvqa Corpus is a multimodal dataset published on HuggingFace by author AHS-uni. The dataset was last updated on June 8, 2025. Its specific content and scale are unknown from the provided metadata.

TextMultimodalNatural Language ProcessingVisual Question Answering+1

0 views

Multimodal & LLM

Uncensor V1 Dpo: Uncensored Direct Preference Optimization Dataset

This DPO dataset contains pairs of harmful prompts and model responses derived from the LLM-LAT/harmful-dataset. It reconfigures the preference structure by labeling standard model refusals as 'rejected' and the original harmful or incorrect answers as 'chosen'.

JSONSize Categories1 Kn10 KLibrarypolarsModalitytextLibrarymlcroissantLibrarydatasetsLibrarypandasRegionus+1

0 views

Multimodal & LLM

FunBench: Benchmarking Multimodal LLMs on Fundus Image Reading

FunBench is a novel visual question answering benchmark designed to evaluate multimodal large language models' fundus reading skills. The dataset was created by AIMClab-RUC and last updated on May 14,我们发现了一个问题。 2025. Code and a description are available on a linked GitHub repository.

MultimodalMedical VisionMultimodal LlmBenchmarkVisual Question Answering+1

0 views

Multimodal & LLM

PathGen-1.6M: Pathology Image-text Pairs Generated via Multi-agent Collaboration

April 2025 is the last update date for this dataset of 1.6 million pathology image-text pairs. It was created by jamessyx and is intended for training Vision Language Models (VLMs) like CLIP. The dataset is designed to support applications in pathology, such as zero-shot image classification and Whole Slide Image analysis.

VideoMultimodalJSONImage Text PairsLibrarypolarsSize Categories1 Mn10 MMedical ImagingVision Language ModelsModalitytextLibrarymlcroissantLibrarydatasetsLibrarypandasLicensecc By 40Computer VisionArxiv240700203RegionusPathologyLarge Scale+1

0 views

Multimodal & LLM

HistBench: A Multimodal Benchmark for Historical Reasoning

HistBench is a benchmark dataset introduced in the paper 'On Path to Multimodal Historical Reasoning: HistBench and HistAgent'. The dataset is hosted on HuggingFace by the author jiahaoq and was last updated on May 27, 2025. Its specific size and structure are not detailed in the provided metadata.

MultimodalAi BenchmarkLlm EvaluationMultimodal BenchmarkHistorical Reasoning+1

0 views

Multimodal & LLM

Open-Qwen2VL Data: Filtered Image-Text Pairs for Multimodal LLM Pre-Training

A collection of filtered image-text pairs from academic resources, used for pre-training the Open-Qwen2VL multimodal large language model. The dataset includes subsets like ccs_ebdataset, derived from CC3M-CC12M-SBU and filtered by CLIP, and datacomp_medium_dfn_webdataset. It was created by weizhiwang and last updated on April 16, 2025.

MultimodalVision LanguageMultimodal LlmPre Training DataAcademic Resources+1

0 views

Multimodal & LLM

Magma: A Foundation Model for Multimodal AI Agents

Magma is a foundation model for multimodal AI agents developed by researchers from Microsoft Research, University of Maryland, University of Wisconsin-Madison, KAIST, and University of Washington. The dataset, last updated on April 12, 2025, is associated with a project page, arXiv paper, and GitHub repository. It likely contains multimodal data for training and evaluating AI agents capable of processing and reasoning across different data types.

MultimodalFoundation ModelVideo ReasoningMultimodal AiAi Agents+1

0 views

Multimodal & LLM

MM-RLHF: A Multimodal LLM Alignment Dataset and Reward Model

MM-RLHF is a project for aligning Multimodal Large Language Models with human preferences. The release includes a high-quality alignment dataset and a strong critique-based reward model. The project was open-sourced by yifanzhang114 in February 2025.

MultimodalRlhfAlignmentMultimodal LlmBenchmarkHuman FeedbackReward Model+1

0 views

Multimodal & LLM

Multimodal Image-Caption Pairs with Synthetic Data Enrichment

FUSION-10M is a large-scale dataset of image-caption pairs designed for pretraining multimodal AI models. It builds upon established datasets like LLaVA, ShareGPT4, and PixelProse and includes 2 million synthesized task-specific pairs. The dataset was created by author starriver030515 and was last updated in April 2025.

ParquetLibrarypolarsLanguagezhTask Categoriesquestion AnsweringLanguageenTask Categoriesvisual Question AnsweringSize Categoriesn1 KModalitytextLibrarymlcroissantTask Categoriestable Question AnsweringModalityimageLibrarydatasetsLibrarypandasArxiv250409925RegionusLicenseapache 20+1

0 views

Multimodal & LLM

SpaceThinker: Synthetic Spatial Reasoning Traces for Vision-Language Models

A dataset created by remyxai and last updated on April 23, 2025. It is designed for training LLaVA-style Vision-Language Models and contains synthesized spatial reasoning traces. The data was generated using VQASynth from a subset of images in the localized narratives split of the cauldron.

MultimodalSpatial ReasoningVision LanguageVqaSynthetic Data+1

0 views

Multimodal & LLM

Shallow Vs Deep Safety Alignment: Derivatives of HEx-PHI for LLM Tuning

A text dataset derived from the LLM-Tuning-Safety/HEx-PHI dataset, intended for research on large language model safety alignment. The dataset was created by Unispac and last updated on April 23, 2025. Its specific content and scale are not detailed in the provided metadata.

TextSafety TuningAlignmentText GenerationLlm Safety+1

0 views

Multimodal & LLM

LiveBench Instruction Following Benchmark

LiveBench is a benchmark for large language models designed to limit test set contamination by releasing new questions monthly. Questions are based on recently-released datasets, arXiv papers, news articles, and IMDb movie synopses. It was created by 'livebench' and last updated in April 2025.

TextParquetLlm BenchmarkLibrarypolarsText GenerationSize Categoriesn1 KModalitytextLibrarymlcroissantLibrarydatasetsBenchmarkLibrarypandasRegionusArxiv240619314Evaluation Metrics+1

0 views

Multimodal & LLM

SWE-bench Multimodal: 617 Real-World GitHub Issue Resolution Tasks

SWE-bench Multimodal provides 617 task instances for evaluating AI systems on real-world software engineering problems. The dataset, created by SWE-bench, was last updated on April 29, 2025. It is designed to test the ability of language models to resolve actual GitHub issues.

MultimodalGithub IssuesAi EvaluationSoftware EngineeringBenchmarkMultimodal Benchmark+1

0 views

PreviousPage 76 of 98Next