DataSalon

Discover quality datasets for AI training — aggregated from 40+ platforms, curated by AI.

ProductSearch Datasets Browse Topics Rankings Community API / MCP

ResourcesDocumentation Blog Changelog Status

LegalPrivacy Policy Terms of Service Cookie Policy

Multimodal & LLM Datasets | DataSalon

All Categories

🔗

Multimodal & LLM

Image-text pairs, instruction tuning, visual QA, cross-modal data, foundation model training data

1,944 datasets

Multimodal & LLM

Encyclopedic-VQA: Visual Questions About Fine-Grained Category Properties

Encyclopedic-VQA is a visual question answering dataset converted to a unified Parquet schema. The dataset, originally from Google and presented at ICCV 2023 by Mensink et al., contains questions about detailed properties of fine-grained categories. The data is hosted on Hugging Face by the author reonokiy and was last updated on April 1, 2026.

MultimodalMultimodal QaComputer VisionFine Grained ClassificationEncyclopedic KnowledgeVisual Question Answering+1

0 views

Multimodal & LLM

LLaVA-LoRA-Sidewalk: Multimodal AI Dataset for Sidewalk Scenes

LLaVA-LoRA-Sidewalk is a dataset hosted on Kaggle. The title suggests it contains multimodal data, likely images and text, related to sidewalk environments. Its specific content, scale, and origin require verification after download.

MultimodalLlavaMultimodal AiComputer VisionSidewalk+1

0 views

Multimodal & LLM

Multi-Sensor Satellite Image and Text Dataset for Earth Observation

464,044 co-registered image-text pairs from Sentinel-1 and Sentinel-2 satellites form this large-scale dataset. It was created by BIFOLD-BigEarthNetv2-0 to advance vision-language learning for remote sensing data. The dataset was last updated on the platform in April 2026.

ImageGeospatialMultimodalTask Categoriesimage Text To TextTask Categoriesmultiple ChoiceArxiv260329630Size Categories1 Mn10 MLanguageenTask Categoriesvisual Question AnsweringTask Idsmultiple Choice QaModalitytextModalitytabularMultispectralVision LanguageMulti SensorBenchmarkTask Idsimage CaptioningComputer VisionSentinel 1Earth ObservationRegionusLicensecdla Permissive 10Large ScaleSentinel 2+1

0 views

Multimodal & LLM

SALMUBench: A Benchmark for Sensitive Association-Level Multimodal Unlearning

SALMUBench is the official evaluation dataset for a CVPR 2026 benchmark on multimodal unlearning. The dataset, authored by cvc-mmu, is designed to assess methods for removing sensitive associations from models. It was last updated on March 30, 2026.

MultimodalBenchmark EvaluationAi SafetyBenchmarkComputer VisionMultimodal Unlearning+1

0 views

Multimodal & LLM

AwaRes: Vision-Language Model Training Data for Efficient High-Resolution Crop Retrieval

Hugging Face hosts the AwaRes training dataset, created by NimrodShabtay1986 and last updated on March 26, 2026. This multimodal dataset supports a spatial-on-demand VLM inference framework designed to process low-resolution images and selectively retrieve high-resolution crops. The associated paper and project page detail the framework's performance benchmarks and efficiency gains.

MultimodalParquetSize Categories10 Kn100 KLibrarypolarsTask Categoriesimage To TextLibrarydaskLanguageenTask Categoriesvisual Question AnsweringVlm TrainingTool CallingHigh ResolutionModalitytextLibrarymlcroissantVision LanguageModalityimageLibrarydatasetsGRPORegionusArxiv260316932VlmSpatial AwarenessLicenseapache 20+1

0 views

Multimodal & LLM

BanglaMedVQA: Bengali Medical Visual Question Answering Training Data

A dataset for fine-tuning the MedGemma-4B vision-language model for Bengali medical question answering. The repository contains training and testing configurations for models like Qwen2.5-VL-7B and MedGemma-4B. It was created by iiCEMAN and last updated on April 8, 2026.

MultimodalVision Language ModelBengali LanguageHealthcareComputer VisionFine TuningMedical Vqa+1

0 views

Multimodal & LLM

MultiNativQA: Multilingual Culturally-Aligned Questions for LLMs

MultiNativQA is a multilingual question-answering resource spanning 7 languages, including high- to extremely low-resource ones. It covers 9 locations/cities and includes dialect variations for languages like Arabic. The dataset was created by QCRI and was last updated on March 31, 2026.

TextMultilingualMultilingual QaLlm EvaluationCultural AlignmentLow Resource LanguagesDialect Variation+1

0 views

Multimodal & LLM

HHRLHF: Human Preference Data for Reinforcement Learning from Human Feedback

A dataset published on Kaggle with the title 'HHRLHF Dataset'. The dataset likely contains text-based examples of human preferences or feedback, intended for training or fine-tuning language models. Its specific content, size, and origin require verification after download.

TextPreference DataReinforcement LearningLlm TrainingHuman Feedback+1

0 views

Multimodal & LLM

MMOU: Benchmark for Multimodal Reasoning on Long, Complex Videos

MMOU is a benchmark for evaluating multimodal models on joint audio-visual understanding and reasoning in long and complex real-world videos. The dataset was created by NVIDIA and last updated on March 28, 2026. It is designed to test models on video, speech, sound, music, and long-range temporal context.

AudioTime SeriesVideoMultimodalJSONSource DatasetsoriginalSize Categories10 Kn100 KLibrarypolarsArxiv260314145LanguageenAi EvaluationLong VideoAudio Visual ReasoningModalitytextLibrarymlcroissantAudio VisualTask Categoriesvideo Text To TextLibrarydatasetsBenchmarkLibrarypandasModalityvideoVideo UnderstandingRegionusLarge ScaleLicenseapache 20Multimodal BenchmarkAnnotations Creatorsexpert Generated+1

0 views

Multimodal & LLM

HalluBench: Geospatial Benchmark for Vision-Language Models

HalluBench is a benchmark dataset for evaluating hallucination in vision language models on geospatial imagery. It was created by AuwAuwAuw and last updated on 2026-04-05. The dataset covers two application domains: emergency disaster assessment and urban scene understanding.

GeospatialMultimodalVision Language ModelsSatellite ImageryBenchmarkDisaster AssessmentComputer VisionUrban SceneGeospatial Benchmark+1

0 views

Multimodal & LLM

Rlhf Learn: Reinforcement Learning Algorithms for Policy Training

Rlhf Learn provides resources for enhancing reinforcement learning stability and efficiency. It focuses on advanced algorithms like TRPO, PPO, DPO, GRPO, DAPO, and GSPO for optimized policy training. The repository was authored by Dylsimple60 and last updated on 2026-05-19.

TabularMachine LearningRlhfPolicy OptimizationReinforcement Learning+1

0 views

Multimodal & LLM

Negation-Aware Grounding Annotations for Flickr30k Images

CoVAND provides annotations for a negation-aware visual grounding dataset built upon the Flickr30k corpus. The dataset was created by author 2na-97 to support the ICLR 2026 paper on negation-aware vision-language models. It was last updated in April 2026.

MultimodalVision LanguageStructured ReasoningImage CaptioningNegation Grounding+1

0 views

Multimodal & LLM

Camp Fire 2018: Multimodal Data Collection

A multimodal dataset related to the 2018 Camp Fire event. The dataset is hosted on Kaggle, but its specific contents, size, and origin are not detailed in the available metadata. Further inspection after download is required to confirm the data types, volume, and collection methodology.

GeospatialMultimodalWildfireDisaster Response+1

0 views

Multimodal & LLM

CT-RATE: 10,000+ Multimodal 3D Chest CT Scans and Radiology Reports

CT-RATE consists of 10,000 to 100,000 3D chest CT scans paired with corresponding radiology reports, released by Ibrahim Hamamci in 2024. This multimodal dataset facilitates the development of 3D medical foundation models through vision-language alignment. It supports diverse tasks including visual question answering, image-to-text generation, and zero-shot classification.

MultimodalSize Categories10 Kn100 KHuggingscienceTask Categoriesquestion AnsweringArxiv240317834Task Categoriesimage To TextLanguageenTask Categoriesvisual Question AnsweringLicensecc By Nc Sa 40Task Categoriestext To ImageChest CtVision LanguageTask Categoriesimage ClassificationTask Categorieszero Shot ClassificationRegionusCt Rate3d-medical-imagingScienceMedical+1

0 views

Multimodal & LLM

Moellava-Package: Multimodal AI Model Components

Moellava-package is a dataset hosted on Kaggle. The title suggests it likely contains components or data related to a multimodal large language model. Metadata is minimal; actual content requires verification after download.

MultimodalVision LanguageMultimodal AiLarge Language Model+1

0 views

Multimodal & LLM

Kvsair VQA: Visual Question Answering Data with Class Labels

A dataset for Visual Question Answering (VQA) tasks, likely containing pairs of images and corresponding questions with answers. The title suggests the data may be organized by specific classes or categories. It is published on the Kaggle platform, but the original author, collection date, and dataset size are unknown.

MultimodalImage Text PairsComputer VisionNatural Language ProcessingVisual Question Answering+1

0 views

Multimodal & LLM

AI Voices Deduplicated: 2,004 High-Quality Speaker-Deduplicated Audio Samples

2,004 high-quality AI voice samples derived from a larger collection of approximately 32,000 samples. The dataset was created by LAION through a process of quality filtering and speaker deduplication using speaker embeddings and clustering. It was last updated on March 17, 2026.

AudioVoice CloningAudio SamplesSpeaker DeduplicationAi Voices+1

0 views

Multimodal & LLM

Hierarchical Visual Question Answering for Emotion and Cognition

InsightVQA is a large-scale benchmark for hierarchical visual question answering that connects emotion understanding with cognitive reasoning. The dataset, created by ziyul707 and last updated in April 2026, is designed to evaluate model capabilities in interpreting emotional causes, grounding evidence, and performing reasoning.

MultimodalBenchmarkEmotion UnderstandingLarge ScaleCognitive ReasoningMultimodal BenchmarkVisual Question Answering+1

0 views

Multimodal & LLM

VLM Voice Commands: 50,000 Natural Language Instructions for Robot Control

VLM Voice Commands is a text dataset of 50,000 curated natural language commands for Vision-Language-Model robot control. The dataset, created by cagataydev and last updated on 2026-03-22, contains diverse commands covering 10 categories of embodied human-robot interaction.

TextAudioVision Language ModelsBenchmarkRoboticsComputer VisionNatural Language ProcessingEmbodied AiVoice Commands+1

0 views

Multimodal & LLM

Ubuntu OSWorld Verified Trajectories: 100K+ Multimodal Agent Paths

OSWorld-Verified Model Trajectories contains between 100,000 and 1,000,000 evaluation records of multimodal AI agents performing tasks in real computer environments. Created by xlangai and updated in March 2026, the data captures verified execution paths and screenshots from state-of-the-art models tested on the OSWorld benchmark.

Size Categories100 Kn1 MCodeRegionusLicensemit+1

0 views

PreviousPage 29 of 97Next