DataSalon

Discover quality datasets for AI training — aggregated from 40+ platforms, curated by AI.

ProductSearch Datasets Browse Topics Rankings Community API / MCP

ResourcesDocumentation Blog Changelog Status

LegalPrivacy Policy Terms of Service Cookie Policy

All Categories

🔗

Multimodal & LLM

Image-text pairs, instruction tuning, visual QA, cross-modal data, foundation model training data

1,939 datasets

Multimodal & LLM

MotIF-1K: Multimodal Human and Robot Motion Trajectories with Task Annotations

MotIF-1K pairs 1,000 multimodal trajectories of human and Stretch-robot motion with task and motion annotations. The dataset was released by authors from MIT and Stanford with the paper 'MotIF: Motion Instruction Fine-tuning' in 2024. It is hosted on Hugging Face by the user 'myconnects'.

MultimodalMultimodal RoboticsMotion TrajectoriesInstruction Following+1

0 views

Multimodal & LLM

SONDER Mini Portfolio 0001: A Multimodal AI Training Dataset

Released on 2026-04-21 by creator Chanda Mandisa Lowrance, PhD, this is the SONDER Mini Portfolio 0001 dataset. It is a multimodal AI training dataset available for purchase at a listed price of $500.00.

MultimodalTraining DataMultimodal AiPortfolio Dataset+1

0 views

Multimodal & LLM

Multimodal Behavioral Authentication Data

A dataset concerning multimodal behavioral authentication, sourced from Kaggle. The dataset's specific content, scale, and collection methodology are not detailed in the provided metadata. Further verification after download is required to confirm the exact data types and structure.

MultimodalMachine LearningMultimodal DataBehavioral BiometricsAuthentication+1

0 views

Multimodal & LLM

Mental Health Social Media Posts with Multimodal Content

Mental Health Social Media Multimodal Dataset is a collection of social media content related to mental health topics. The dataset is hosted on Kaggle, but its specific size, origin, and creation date are not provided in the available metadata. Columns and sample data are unknown, limiting detailed assessment of its structure and content.

TextMultimodalMental HealthMultimodal DataSocial MediaSentiment AnalysisHealthcareNatural Language Processing+1

0 views

Multimodal & LLM

Mental Health Social Media Multimodal Data from Kaggle

A multimodal dataset likely containing social media posts related to mental health topics. The dataset is hosted on Kaggle, but its specific volume, content details, and creation date are unknown. The original author, organization, and data collection methodology are not specified in the provided metadata.

MultimodalMental HealthBehavioral AnalysisMultimodal DataSocial MediaHealthcareNatural Language Processing+1

0 views

Multimodal & LLM

VARISHTA-MM50: Smartphone Sensor and Video Data for Human Activity Recognition

Smartphone sensor and video data for human activity recognition and fall detection. The dataset focuses on the Indian elderly population. The author, organization, and specific collection dates are unknown.

ImageMultimodalMedical ImagingMultimodal DataComputer VisionElderly HealthFall DetectionMm50Human Activity RecognitionIndian Population+1

0 views

Multimodal & LLM

LEMON: Large Endoscopic Monocular Video Dataset for Surgical Perception

LEMON is a large dataset of full FPS endoscopic monocular videos introduced in the paper 'LEMON: A Large Endoscopic MONocular Dataset and Foundation Model for Perception in Surgical Settings'. The dataset is hosted by the user 'visurg' on Hugging Face and was last updated on April 8, —. The repository provides the full video collection for download.

VideoMultimodalSurgical VisionMedical PerceptionComputer VisionEndoscopic Video+1

0 views

Multimodal & LLM

NSFW-T2I: 38,000 Image-Text Pairs with AI-Generated Captions

38,000 image-text pairs sourced from LAION and nsfw_detect datasets. Captions were generated by the LLaVA-NeXT model using a prompt requesting detailed descriptions of person attributes. The dataset was created by author K00B404 and last updated on Hugging Face in April 2026.

MultimodalImage Text PairsNsfw ContentMultimodal TrainingComputer VisionAi Generated CaptionsSynthetic+1

0 views

Multimodal & LLM

Chuckle-WavLM-555-Videos: Audio-Visual Data for Speech Processing

Chuckle-WavLM-555-Videos is a dataset hosted on Kaggle. The title suggests it likely contains audio and video data, potentially for speech or multimodal machine learning tasks. The dataset's specific content, size, and collection details are not provided in the available metadata.

MultimodalMachine LearningAudio VideoSpeech Processing+1

0 views

Multimodal & LLM

ImagenWorld: Benchmark for Image Generation and Editing Models

ImagenWorld is a large-scale benchmark designed to evaluate image generation and editing models in realistic multimodal scenarios. It spans six diverse tasks and six content domains, providing a unified framework for assessing model compositionality, instruction following, and multimodal capability. The dataset is hosted by TIGER-Lab and was last updated on April 14, 2026.

MultimodalAi EvaluationBenchmarkComputer VisionLarge ScaleMultimodal Benchmark+1

0 views

Multimodal & LLM

VisualOverload: A Visual Question Answering Benchmark with 2,720 Question-Answer Pairs

2,720 question–answer pairs comprise the VisualOverload benchmark for visual question answering (VQA). It was created by paulgavrikov and presented at CVPR 2026, with a last update recorded on 2026-04 15. The dataset is designed to challenge models on visual understanding tasks beyond global image comprehension.

MultimodalMultimodal AiBenchmarkComputer VisionVisual Question AnsweringVqa Benchmark+1

0 views

Multimodal & LLM

Phi-4-Multimodal-Instruct-Local: A Multimodal Instruction Dataset

A dataset titled 'phi4-multimodal-instruct-local' published on Kaggle. The title suggests it likely contains instruction-response pairs for multimodal AI model training. The dataset's specific content, size, and creation details are not provided in the available metadata.

MultimodalMultimodal AiLanguage ModelInstruction Tuning+1

0 views

Multimodal & LLM

Gemma4 DocVQA Results: Model Performance on Document Visual Question Answering

Results from evaluating the Gemma4 model on a Document Visual Question Answering (DocVQA) task. The dataset was published on the Hugging Face platform by the author G2good4uG and was last updated on June 4, 2026. The specific metrics, scores, and underlying test data are not detailed in the available metadata.

MultimodalLlm EvaluationGemma ModelMultimodal BenchmarkDocument Vqa+1

0 views

Multimodal & LLM

RADVQA: Radiology Image Quality Assessment Dataset

RADVQA appears to be a dataset related to radiology and visual question answering, hosted on Kaggle. Its specific contents, such as the number of images or questions, are not detailed in the available metadata. The dataset's author, organization, and last update date are currently unknown.

MultimodalMedical ImagingQuality AssessmentRadiology+1

0 views

Multimodal & LLM

Deeptumorvqa: Medical Images for Tumor Visual Question Answering

Deeptumorvqa Image contains origin images collected from publicly available datasets. The dataset was uploaded by ZiyueWang and was last updated on May 13, 2026. Its intended use is for tasks related to tumor analysis and visual question answering.

ImageMultimodalMedical ImagingTumor AnalysisComputer VisionDeep LearningVisual Question Answering+1

0 views

Multimodal & LLM

Medical Image Caption Pairs with Expert and Layman Descriptions

MedLayBench-V provides 79,789 medical image-text pairs across 7 imaging modalities. Each image is paired with both a clinical expert caption and a patient-friendly layman caption. The dataset, created by hanjang, was released in April 2026.

MultimodalMedical ImagingMedical Vision LanguageBenchmarkExpert Lay AlignmentHealthcareComputer VisionLarge ScaleMultimodal Benchmark+1

0 views

Multimodal & LLM

AI Ethics Preference Annotations for 95 Prompts and 190 Response Pairs

A human-annotated preference dataset for RLHF and Direct Preference Optimization (DPO), focused on AI ethics failure modes. It contains 95 prompts and 190 response pairs, with full annotation across five dimensions. The dataset was created by AI ethics specialist Mandy Hathaway and last updated on 2026-04-13.

TextRlhfAi EthicsFailure ModesPreference AnnotationDpoSynthetic+1

0 views

Multimodal & LLM

Vimedpet-DirectVLM-20260712: A Vision-Language Model Dataset

The dataset title suggests a resource related to the DirectVLM model, likely released in July 2024. It is hosted on the Kaggle platform, but detailed metadata such as author, size, and specific content are not provided. The data's nature and scale must be verified after download.

MultimodalVision Language ModelMultimodal AiVimedpetDirectvlm+1

0 views

Multimodal & LLM

MSVD-Blip2QFormer: Video Captioning Dataset for Multimodal AI

MSVD-Blip2QFormer is a dataset likely derived from the Microsoft Research Video Description (MSVD) corpus, processed through the BLIP-2 model's Q-Former component. It is hosted on Kaggle, but specific details about its size, creation date, and author are not provided. The dataset appears designed for training and evaluating multimodal AI systems that link visual and textual information.

MultimodalMsvdQformerMultimodal AiVideo CaptioningBlip 2+1

0 views

Multimodal & LLM

K-MetBench: A Multi-Dimensional Benchmark for Meteorology Models

K-MetBench is a multi-dimensional benchmark for evaluating meteorology models across accuracy, reasoning quality, geo-cultural alignment, and fine-grained domain coverage. The dataset was created by soyeonbot and was last updated on Hugging Face in April 2026. Its public evaluation protocol uses an explicit advanced benchmark and an explicit reasoning benchmark followed by LLM-as-a-judge evaluation.

MultimodalMeteorologyBenchmarkLlm EvaluationReasoning BenchmarkFine Grained EvaluationMultimodal Benchmark+1

0 views

PreviousPage 24 of 97Next

Multimodal & LLM Datasets | DataSalon