DataSalon

Discover quality datasets for AI training — aggregated from 40+ platforms, curated by AI.

ProductSearch Datasets Browse Topics Rankings Community API / MCP

ResourcesDocumentation Blog Changelog Status

LegalPrivacy Policy Terms of Service Cookie Policy

Multimodal & LLM Datasets | DataSalon

All Categories

🔗

Multimodal & LLM

Image-text pairs, instruction tuning, visual QA, cross-modal data, foundation model training data

1,956 datasets

Multimodal & LLM

Openhermes 2.5 MIG 50K: High-Quality Instruction Tuning Data

Openhermes-2.5-MIG-50K is a dataset containing 50,000 high-quality and diverse samples for supervised fine-tuning. The data was selected from the Openhermes2.5 dataset using the MIG method, which automatically selects data by maximizing information gain in semantic space. Author xsample uploaded the dataset to Hugging Face on April 24, 2025.

TextLanguage ModelInstruction TuningData SelectionSft Data+1

0 views

Multimodal & LLM

Reason-RFT: Reinforcement Fine-Tuning Dataset for Visual Reasoning

A multimodal dataset used in the 'Reason-RFT: Reinforcement Fine-Tuning for Visual Reasoning' project. The dataset was created by author 'tanhuajie2001' and was last updated on the Hugging Face platform on April 18, 2025. Its description suggests it is intended to enhance embodied reasoning capabilities for systems like RoboBrain.

MultimodalMultimodal AiRoboticsReinforcement LearningVisual Reasoning+1

0 views

Multimodal & LLM

BiomedParseData: Biomedical Image Segmentation and Detection Across Nine Modalities

A processed dataset for training a foundation model for joint segmentation, detection, and recognition of biomedical objects. Each instance includes a 1024x1024 PNG image, a list of textual descriptions for the segmentation target, and a corresponding 1024x1024 binary ground truth mask. The dataset is hosted by Microsoft and was last updated in April 2025.

MultimodalMultimodal Foundation ModelMedical VisionBiomedical ImagingComputer Vision+1

0 views

Multimodal & LLM

Awsome Multi Modal Based Phm

1 curated collection of multi-modal Prognostics and Health Management (PHM) resources organized into fault diagnosis and fault prediction categories. The content addresses the integration of diverse sensor data for industrial equipment health monitoring.

BenchmarkPaper+1

0 views

Multimodal & LLM

JourneyBench Multi-Image Visual Question Answering Test Set

JourneyBench Multi_Image_VQA is a test-only dataset for debugging multimodal reasoning models. It contains visual question answering examples requiring analysis across multiple images. The dataset was created by author hiyouga and last updated in April 2025.

ImageTextEnglishImage UnderstandingComputer VisionMultimodal ReasoningVisual Question Answering+1

0 views

Multimodal & LLM

KSL-LEX: Korean Sign Language Lexical Database

6,463 entries representing 6,289 unique headwords in Korean Sign Language (KSL), derived from the Korean Sign Language Dictionary's everyday signs collection. The dataset provides detailed linguistic annotations for each sign, expanding upon the original dictionary's 3,669 signs to offer a more granular lexical database.

CSVSize Categories1 Kn10 KLibrarypolarsLexical DatabaseModalitytextLicensegpl 30LibrarymlcroissantLibrarydatasetsLibrarypandasSign LanguageLanguagekoRegionusKorean Sign LanguageLinguistics+1

0 views

Multimodal & LLM

SurveillanceVQA: 589K Visual Question Answering Pairs

SurveillanceVQA 589K is a dataset for visual question answering tasks, likely containing image-question-answer pairs. The dataset was created by author fei213 and was last updated on Hugging Face on 2025-05-16 03:52:51. Its specific content, such as the source and nature of the surveillance imagery, requires verification after download.

MultimodalSurveillanceComputer VisionRegionusQa DatasetLicensemitVisual Question Answering+1

0 views

Multimodal & LLM

OpenVLMRecords: Vision-Language Model Evaluation Results from VLMEvalKit

Evaluation records generated by VLMEvalKit, reflecting the OpenVLM Leaderboard. The dataset was last updated on 2025-04 08 06:23:26 and is maintained by the author VLMEval. It contains results from evaluating various Vision-Language Models (VLMs) on different benchmarks.

TabularVision Language ModelsLeaderboardModel EvaluationBenchmarkBenchmark RecordsSynthetic+1

0 views

Multimodal & LLM

Traffic VQA: Visual Question Answering for Traffic Scenes

Traffic VQA is a multimodal dataset for visual question answering tasks, likely containing images or videos of traffic scenes paired with textual questions. It was published on the Hugging Face platform by author YuYu2004 and was last updated on May 16, 2025. The specific content, scale, and annotation details require verification after download.

MultimodalMultimodal AiComputer VisionTraffic AnalysisVisual Question Answering+1

0 views

Multimodal & LLM

STCray: Multimodal X-Ray Baggage Security Dataset

A multimodal X-ray baggage security dataset introduced by Naoufel555 in 2025. It is described as the first of its kind, designed to address limitations in representing real-world, sophisticated threats and concealment tactics. The dataset aims to move beyond closed-set paradigms with predefined labels for computer-aided screening systems.

MultimodalSecurity ThreatsX-RayComputer VisionBaggage Scan+1

0 views

Multimodal & LLM

SPAR-Bench: 7,207 Spatial Reasoning QA Pairs Across 20 Tasks

SPAR-Bench contains 7,207 manually verified spatial reasoning question-answer pairs across 20 distinct tasks, released by jasonzhango in 2025. The benchmark evaluates vision-language models using single-view, multi-view, and video modalities to test spatial perception and reasoning capabilities.

ParquetSize Categories1 Kn10 KLibrarypolarsLibrarydaskModalitytextLibrarymlcroissantModalityimageLibrarydatasetsArxiv250322976Regionus+1

0 views

Multimodal & LLM

OmniCorpus-YT: 10 Million Image-Text Documents from YouTube Videos

OpenGVLab's OmniCorpus-YT is a large-scale multimodal dataset containing 10 million image-text interleaved documents collected from YouTube videos. The dataset is part of the broader OmniCorpus project, which encompasses billions of images, and was presented in an ICLR 2025 Spotlight paper. The repository was last updated on March 20, 2025.

MultimodalMultimodal CorpusComputer VisionLarge ScaleNatural Language ProcessingImage Text Interleaved+1

0 views

Multimodal & LLM

Chebi 20 Multimodal: Chemical Entity and Image Data

A multimodal dataset likely containing information related to chemical entities, as suggested by the title 'Chebi'. The dataset was published on huggingface by the author jablonkagroup and was last updated on 2025 05 11. The platform tags indicate it contains both image and text modalities.

ImageMultimodalParquetTextLibrarypolarsLibrarydaskModalitytextSize Categories100 Kn1 MLibrarymlcroissantModalityimageLibrarydatasetsBioinformaticsChemistryRegionus+1

0 views

Multimodal & LLM

OmniCorpus-CC: 988 Million Image-Text Interleaved Documents

OmniCorpus-CC is a unified multimodal corpus of 10 billion-level images interleaved with text. It contains 988 million image-text interleaved documents collected from Common Crawl. The dataset was created by OpenGVLab and was last updated on the platform in March 2025.

MultimodalParquetLibrarypolarsTask Categoriesimage To TextLibrarydaskLanguageenTask Categoriesvisual Question AnsweringSize Categories100 Mn1 BWeb CrawlModalitytextArxiv240608418LibrarymlcroissantVision LanguageImage TextLibrarydatasetsLicensecc By 40Computer VisionRegionusLarge ScaleNatural Language Processing+1

0 views

Multimodal & LLM

M3Docvqa: Multimodal Document Visual Question Answering Dataset

M3Docvqa is a multimodal dataset published on HuggingFace by YeMoKoo on May 3, 2025. The dataset likely contains document images paired with questions and answers. Its specific size, format, and content require verification after download.

MultimodalVision LanguageQuestion AnsweringDocument Vqa+1

0 views

Multimodal & LLM

MME-CoT: Benchmarking Chain-of-Thought Reasoning in Multimodal Models

MME-CoT is a benchmark dataset for evaluating Chain-of-Thought reasoning in Large Multimodal Models. It was created by author CaraJ and published on Hugging Face, with its last update recorded on 2025-03-19. The dataset focuses on assessing reasoning quality, robustness, and efficiency.

MultimodalChain Of ThoughtBenchmarkReasoning EvaluationLarge Language ModelsMultimodal Benchmark+1

0 views

Multimodal & LLM

Multimodal Textbook of 6.5 Million Instructional Video Keyframes

6.5 million keyframe images are interleaved with 0.8 billion words of ASR text from instructional videos, forming a corpus for vision-language pretraining. The dataset was created by DAMO-NLP-SG for the research project '2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining' and was last updated in March 2025.

Task Categoriestext GenerationSize Categories1 Mn10 MLanguageenTask CategoriessummarizationInterleavedPretrainingRegionusReasoningArxiv250100958Licenseapache 20+1

0 views

Multimodal & LLM

Omni Med Vqa Mini: A Medical Visual Question Answering Dataset

The 'Omni Med Vqa Mini' dataset was published on the Hugging Face platform by author 'simwit' and last updated on 2025-04-24 17:24:49. Its title suggests it contains medical images paired with questions and answers. The specific content, size, and structure require verification after download.

MultimodalMedical ImagingMultimodal QaVision LanguageMedical Vqa+1

0 views

Multimodal & LLM

BigDocs-Bench: Benchmark for Multimodal Document and Code Tasks

BigDocs-Bench is a benchmark suite introduced by ServiceNow for evaluating multimodal models on tasks that transform visual inputs into structured outputs. The dataset is associated with the paper 'BigDocs: An Open and Permissively-Licensed Dataset for Training Multimodal Models on Document and Code Tasks'. The benchmark data was initially released on 2024-12-10 and last updated on the platform on 2025-03-19.

MultimodalParquetGui IntentDocument UnderstandingLibrarypolarsLibrarydaskModalitytextSize Categories100 Kn1 MLibrarymlcroissantModalityimageLibrarydatasetsBenchmarkLicensecc By 40Computer VisionArxiv241204626RegionusStructured OutputMultimodal Benchmark+1

0 views

Multimodal & LLM

Babillage: Multimodal Spoken Dialogue Benchmarks for Vision Speech Models

Babillage is a multimodal benchmark dataset introduced alongside MoshiVis. It contains three common vision-language benchmarks—COCO-Captions, OCR-VQA, and VQAv2—converted into spoken form for evaluating Vision Speech Models. The dataset was created by kyutai and last updated on March 21, 2025.

AudioMultimodalBenchmarkComputer VisionSpoken DialogueSynthetic SpeechVision Speech ModelsMultimodal BenchmarkSynthetic+1

0 views

PreviousPage 77 of 98Next