DataSalon

Discover quality datasets for AI training — aggregated from 40+ platforms, curated by AI.

ProductSearch Datasets Browse Topics Rankings Community API / MCP

ResourcesDocumentation Blog Changelog Status

LegalPrivacy Policy Terms of Service Cookie Policy

Multimodal & LLM Datasets | DataSalon

All Categories

🔗

Multimodal & LLM

Image-text pairs, instruction tuning, visual QA, cross-modal data, foundation model training data

1,937 datasets

Multimodal & LLM

Ko Vdr Hn: Korean Visual Document Retrieval Hard Negatives for Embedding Models

Korean Visual Document Retrieval Hard Negatives is a multimodal training set for fine-tuning embedding models. The dataset, created by whybe-choi, was last updated on 2026-04-25. Each row contains a text query, a page image document, one positive match, and seven mined hard negatives.

MultimodalKorean LanguageDocument ImagesEmbedding TrainingComputer VisionMultimodal Retrieval+1

0 views

Multimodal & LLM

HH-RLHF: Human Preference Data for Reinforcement Learning from Human Feedback

A dataset likely containing human preference labels for training reinforcement learning from human feedback (RLHF) models. The dataset is published on Kaggle, but specific details about its size, creation date, and authors are not provided in the available metadata. Its title suggests it is related to the 'Helpful and Harmless' (HH) benchmark for aligning language models.

TextTabularPreference DataReinforcement LearningHuman FeedbackAi Alignment+1

0 views

Multimodal & LLM

TDTU VQA Dataset: Vietnamese Medicinal Herbs for Visual Question Answering

Vietnamese Visual Question Answering dataset focused on medicinal plants and herbs. It was developed for scientific research at Ton Duc Thang University (TDTU) to advance AI models for herb recognition and question answering. The dataset was last updated on Hugging Face in April 2026.

MultimodalMedicinal PlantsHerbsVietnameseVisual Question Answering+1

0 views

Multimodal & LLM

A11y-CUA: Multimodal Computer-Use Accessibility Dataset with Real Task Trajectories

A11y-CUA is a multimodal dataset containing real computer-use task trajectories from sighted users, blind and low vision users, and AI agents. The dataset includes structured interaction logs, metadata, screen video, and system audio for each task. It was created by berkeley-hci and was last updated on Hugging Face in April 2026.

AudioMultimodalTask TrajectoryComputer VisionAccessibilityAssistive TechnologyHuman Computer Interaction+1

0 views

Multimodal & LLM

Multimodal Unlearning Evaluation Benchmark for VQA and CIFAR-10

Evaluation outputs for studying metric inconsistency in multimodal machine unlearning, supporting reproducibility for a NeurIPS 2026 paper. The dataset contains results on VQA benchmarks (MLLMU-Bench, UnLOK-VQA, MMUBench) and CIFAR-10 baseline results. It was created by author 'neurips26' and last updated on 2026-05-01.

MultimodalNeuripsMachine UnlearningBenchmarkVqa BenchmarksMultimodal Evaluation+1

0 views

Multimodal & LLM

ChartWise AutoScientist: A Multimodal Chart Interpretation Dataset

A multimodal dataset for chart interpretation and visual reasoning tasks. The dataset was sourced from Kaggle, but specific details about its size, authorship, and creation date are not provided. Its content likely contains chart images paired with textual descriptions or questions to support visual reasoning.

MultimodalChart InterpretationMultimodal AiVisual Reasoning+1

0 views

Multimodal & LLM

Strawberry Disease Multimodal Dataset with Environmental Parameters

A multimodal dataset for strawberry disease detection contains strawberry image data, corresponding environmental parameters (air temperature, air humidity, soil moisture) and strawberry variety information. It can be used to study the correlation between environmental factors and strawberry disease occurrence, as well as multimodal fusion disease detection algorithms. The dataset was authored by Qin2006 and last updated on 2026-04-19.

MultimodalHealthcareComputer VisionAgricultureEnvironmental DataPlant Disease+1

0 views

Multimodal & LLM

Mm Straw5: Strawberry Disease Images with Environmental Parameters

The Strawberry Disease Multimodal Dataset by Qin2006 contains strawberry image data paired with environmental parameters and variety information. It is designed for studying correlations between environmental factors and disease occurrence, as well as multimodal fusion detection algorithms. The dataset was last updated on HuggingFace on 2026-04-19.

MultimodalPlant HealthHealthcareComputer VisionAgricultureEnvironmental DataStrawberry Disease+1

0 views

Multimodal & LLM

MedMNIST+ 2D Dataset Overview for BioFuse Framework

An overview of the MedMNIST+ 2D benchmark datasets used to evaluate the BioFuse embedding fusion framework. The dataset is 9.5 KB in size, authored by Mirza Nasir Hossain, and was last updated on March 18, 2026. It was used to test a framework that fuses embeddings from 9 state-of-the-art foundation models to achieve high performance on biomedical image classification tasks.

TabularSource Framework DesignedMedmnist BenchmarkAchieving Sota AucBiomedical ImagingFoundation ModelsRemaining DatasetsModel CompatibilityUsing XgboostBenchmarkVector ConcatenationOptimal CombinationSpecific SubdomainsEmbedding Fusion FrameworkFusion TechniquesPoses ChallengesFeature FusionModels PromisesModal RelationshipsNovel OpenEmploys Grid SearchIncorporate Future ModelsUncovering CrossEmbedding FusionSota Performance AcrossXlink+1

0 views

Multimodal & LLM

En Vdr Hn: English Visual Document Retrieval Hard Negatives for Training

En Vdr Hn is a multimodal retrieval training set for fine-tuning visual-document embedding models on English document pages. The dataset, created by whybe-choi and last updated on 2026-04 26, provides query text and page image pairs, with each row containing one positive and seven mined hard negatives. Hard negatives were mined using the Qwen/Qwen3-VL-Embedding-8B model within each source dataset.

MultimodalDocument ImagesComputer VisionMultimodal RetrievalEnglish Language+1

0 views

Multimodal & LLM

Multimodal Translation Quality Dataset with Text, Image, and Audio Features

Kaggle hosts a dataset for evaluating English translation quality. It contains multimodal features, including text, image, and audio data. The author, organization, and specific data volume are not provided in the available metadata.

AudioMultimodalMachine TranslationBenchmarkComputer VisionEnglish LanguageQuality EvaluationMultimodal Translation+1

0 views

Multimodal & LLM

DocVQA Media Labeled Clean: A Dataset for Document Visual Question Answering

DocVQA Media Labeled Clean is a dataset hosted on Hugging Face by author merve. The dataset was last updated on June 5, 2026. Its specific content and scale are unknown from the provided metadata.

MultimodalLabeled DataMultimodal AiVisual Question AnsweringDocument Vqa+1

0 views

Multimodal & LLM

MERRIN: A Human-Annotated Benchmark for Multimodal Reasoning in Noisy Web Environments

MERRIN is a human-annotated benchmark for evaluating search-augmented agents on multi-hop reasoning over noisy, multimodal web sources. It measures agents' ability to identify relevant modalities, retrieve evidence from the open web, and reason over conflicting sources spanning text, images, video, and audio. The dataset was created by HanNight and was last updated on 2026-04-16.

AudioMultimodalEvidence RetrievalWeb DataAi BenchmarkBenchmarkHuman AnnotatedMultimodal Reasoning+1

0 views

Multimodal & LLM

Sambit-Multimodal-TabM-v1: A Multimodal Dataset for Machine Learning

A dataset titled 'sambit-multimodal-tabm-v1' is hosted on Kaggle. The title suggests it contains multimodal data, likely combining tabular information with other data types. No further details on size, origin, or specific content are available from the provided metadata.

TabularMultimodalMachine LearningAi Training+1

0 views

Multimodal & LLM

Case Report and Literature Review on Rare Apical Hypertrophic Cardiomyopathy

A figshare document by Yuting Yi, last updated in March 2026, details a single clinical case of a 54-year-old woman with a rare cardiac condition. The 19.6 KB file includes a case report and a systematic literature review identifying 13 reported cases of ApHCM with calcification. The document integrates clinical presentation, multimodality imaging, histopathology, and genetic analysis findings.

TextMedical Case ReportGenetic TestingMultimodality ImagingMyocardial CalcificationPet Ct 18 FluorodeoxyglucoseHealthcareApical Hypertrophic CardiomyopathyCardiomyopathyEndomyocardial Fibrosis+1

0 views

Multimodal & LLM

Indian Hybrid Papaya Multimodal Dataset for Smart Agriculture

The Indian Hybrid Papaya Multimodal Dataset is a collection of data for smart agriculture research. Its specific size, author, and last update date are unknown. The dataset appears to be designed for multimodal analysis of papaya crops.

MultimodalSmart FarmingComputer VisionAgriculturePapaya+1

0 views

Multimodal & LLM

TAB-VLM: Temporal Anachronism Benchmark for Vision-Language Models

TAB-VLM is a benchmark dataset containing 600 examples designed to measure cultural anachronism in Vision-Language Models. It was created by authors Mukul Ranjan, Prince Jha, Khushboo Kumari, and Zhiqiang Shen, with a paper accepted for ACL 2026 Findings. The dataset assesses the tendency of models to misinterpret historical artifacts using temporally inappropriate concepts.

Time SeriesMultimodalVision Language ModelsAi BenchmarkBenchmarkComputer VisionCultural AnachronismTemporal Reasoning+1

0 views

Multimodal & LLM

InternData-A1: 630,000 Robotic Trajectories Across 4 Embodiments

InternData-A1 contains over 630,000 trajectories and 7,433 hours of robotic manipulation data across 4 embodiments and 227 scenes. Created by InternRobotics and documented in Arxiv 2511.16651, it provides a hybrid synthetic-real collection covering 18 skills and 70 tasks.

Modality3dRobotic manipulationLanguageenModalitytextTask CategoriesroboticsModalityimageTask CategoriesotherSize Categoriesn1 TRegionusArxiv251116651+1

0 views

Multimodal & LLM

ASVspoof-WavLM-Clean-Model: Audio Deepfake Detection Model

A machine learning model likely related to the ASVspoof challenge for detecting spoofed or deepfake audio. It appears to be based on the WavLM architecture and is described as a 'clean' model, suggesting a focus on robustness or specific training conditions. The dataset is hosted on Kaggle, but detailed metadata about its contents and creation are unavailable.

AudioMultimodalMachine Learning ModelsSpeech ProcessingAudio Spoofing Detection+1

0 views

Multimodal & LLM

ASVspoof WavLM Disent Model: Audio Spoofing Detection Model

ASVspoof WavLM Disent Model is a machine learning model for detecting spoofed audio, likely related to the ASVspoof challenge series. It is published on Kaggle, a platform for data science and machine learning. The model's architecture appears to involve WavLM, a self-supervised speech representation model, and disentanglement techniques.

AudioMultimodalMachine Learning ModelsSpeech ProcessingAudio Spoofing Detection+1

0 views

PreviousPage 22 of 97Next