DataSalon

Discover quality datasets for AI training — aggregated from 40+ platforms, curated by AI.

ProductSearch Datasets Browse Topics Rankings Community API / MCP

ResourcesDocumentation Blog Changelog Status

LegalPrivacy Policy Terms of Service Cookie Policy

Multimodal & LLM Datasets | DataSalon

All Categories

🔗

Multimodal & LLM

Image-text pairs, instruction tuning, visual QA, cross-modal data, foundation model training data

1,936 datasets

Multimodal & LLM

PhysRL: Physics Reasoning Datasets for Multimodal AI Training

Kun-Xiang created the PhysRL collection to accompany the SeePhys Pro research paper. The dataset includes the full PhysRL-38K corpus and a vision-necessary subset of PhysRL-8K, used for studying multimodal reasoning in physics. It was last updated on HuggingFace on 2026-05-13.

MultimodalVision LanguageMultimodal AiScientific BenchmarkComputer VisionReinforcement LearningPhysics Reasoning+1

0 views

Multimodal & LLM

DermoInstruct: Large-Scale Dermatology Visual Instruction Tuning Dataset

DermoInstruct is a large-scale dermatology-oriented visual instruction tuning dataset for multimodal medical AI research. It contains image-grounded instruction-response conversations designed to support tasks such as lesion description, morphology recognition, and diagnostic reasoning. The dataset was created by author mendicant04 and was last updated on 2026-05-14.

MultimodalVisual QaDermatologyHealthcareComputer VisionLarge ScaleMedical AiMultimodal Instruction Tuning+1

0 views

Multimodal & LLM

DIM-T2I: Text-to-Image and Image Editing Dataset for AI Models

DIM-Edit dataset accompanies the DIM-4.6B-T2I and DIM-4.6B-Edit models released in October 2025. The dataset supports research on rebalancing designer and painter roles in unified multimodal models for image editing. It was created by Ziyun Zeng, David Junhao Zhang, Wei Li, and Mike Zheng Shou, with the associated paper accepted to ICLR 2026.

MultimodalMachine LearningMultimodal AiComputer Vision+1

0 views

Multimodal & LLM

Wiki-CoE: Multimodal Question Answering with Wikipedia Screenshots

Wiki-CoE is a multimodal question-answering dataset for evaluating visual reasoning and evidence localization. Each example pairs a natural-language question with one or more Wikipedia page screenshots, asking models to return both an answer and an explicit chain of supporting evidence. The dataset was created by PeiyangLiu and was last updated on the Hugging Face platform in May 2026.

MultimodalMultimodal QaEvidence LocalizationVisual Reasoning+1

0 views

Multimodal & LLM

ChartInt: A Multimodal Chart Dataset for Reconstruction and Editing Tasks

ChartInt is a multimodal chart dataset designed for tasks such as chart reconstruction, editing, style transfer, interaction editing, and data updates. The dataset, created by xilinghuiye, contains 2,905 rows in its train split and was last updated on May 3, 2026. It is packaged as a datasets-compatible Parquet file for direct viewing on Hugging Face.

MultimodalStyle TransferChart EditingChart ReconstructionMultimodal Charts+1

0 views

Multimodal & LLM

SGMRI-VQA: 41,307 Expert-Annotated MRI Visual Question Answering Pairs

SGMRI-VQA is a 41,307-pair benchmark for spatially grounded reasoning on multi-frame MRI scans. It was built by SpatialGroundingVQA from expert radiologist annotations in the fastMRI+ dataset, covering brain and knee studies. Each question-answer pair includes a clinician-aligned chain-of-thought reasoning trace and frame-indexed bounding-box coordinates.

MultimodalMedical ImagingMultimodal AiBenchmarkMriRadiologyVisual Question Answering+1

0 views

Multimodal & LLM

Survey and Visual Indicators for Tourism Behavior in Macao

Youcheng Wang's 2026 study integrates survey data from 519 non-local visitors in Macao with street-level visual indicators. The multimodal analysis examines relationships among destination image, perceived value, perceived risk, satisfaction, attitude, and responsible tourism behavioral intention.

Multimodal AnalysisHigh Density Urban DestinationUrban Visual DensityMacaoStreet Level Visual IndicatorsResponsible Tourism Behavior+1

0 views

Multimodal & LLM

Lost On Campus: Vision-Language Model Benchmark for Outdoor Navigation

Lost On Campus benchmark evaluates Embodied Scene Representation (ESR) of Vision-Language Models in large-scale real-world outdoor 3D environments reconstructed by 3D Gaussian Splatting. It introduces a unified reasoning-action evaluation framework integrating diagnostic QA and closed-loop interactive navigation under multimodal instructions. The dataset is authored by lost-on-campus-project and was last updated on 2026-05-07.

GeospatialMultimodalVision Language ModelUniversityBenchmarkNavigationComputer VisionCampus NavigationLarge Scale3d EnvironmentIndoor Mapping+1

0 views

Multimodal & LLM

InfraFlood-NC: Infrastructure-Specific Flood Annotations for North Carolina

Six urban areas in North Carolina impacted by Hurricanes Matthew and Florence are covered by this dataset. It provides binary flood extent annotations paired with building footprints and road networks, derived from high-resolution (1.5 cm to 25 cm) imagery. The data is structured into 10 spatial divisions and formatted for both deep learning model training and traditional GIS analysis.

GeospatialMultimodalTextXMLVisual Question Answering VqaHurricane ImpactInfrastructure Flood ExtentGeo JsonGeospatial AnnotationsNorth CarolinaInfrastructureFlood ExtentShapefile DataVisual Question Answering+1

0 views

Multimodal & LLM

Structural and Functional Brain Connectivity in ALS Patients

A multimodal connectomic analysis of Amyotrophic Lateral Sclerosis integrates cortical thickness-based structural covariance networks, diffusion MRI tractography, and resting-state and task-based functional MRI. The study employs a 104-node parcellation scheme based on the Desikan-Killiany atlas to examine structure-function coupling in ALS patients and matched controls. It reports preserved global network topology but selective reorganization within motor and interhemispheric pathways.

Brain NetworkDtiConnectivityAls+1

0 views

Multimodal & LLM

Structural and Functional Brain Connectivity in ALS Patients

A multimodal connectomic analysis of Amyotrophic Lateral Sclerosis integrates cortical thickness-based structural covariance networks, diffusion MRI tractography, and resting-state and task-based functional MRI. The study employs a 104-node parcellation scheme based on the Desikan-Killiany atlas to examine structure-function coupling and network reorganization in ALS patients versus matched controls.

Brain NetworkDtiConnectivityAls+1

0 views

Multimodal & LLM

Multimodal Brain Connectivity in Amyotrophic Lateral Sclerosis

A study of structural and functional brain connectivity in Amyotrophic Lateral Sclerosis (ALS) patients and matched controls. The analysis employed a 104-node brain parcellation scheme, integrating cortical thickness, diffusion MRI tractography, and resting-state and task-based functional MRI. Graph-theoretical metrics were derived to examine cross-modal structure–function correspondence.

Brain NetworkDtiConnectivityAls+1

0 views

Multimodal & LLM

CoALS II: Structural and Functional Brain Connectivity in ALS Patients

A multimodal connectomic analysis integrating cortical thickness, diffusion MRI, and resting-state and task-based functional MRI from ALS patients and matched controls. The study employed a 104-node brain parcellation scheme and graph-theoretical metrics to analyze structure–function coupling. The dataset, authored by Vijay Renga and last updated in March 2026, is shared under a CC-BY-4.0 license.

ImageGraphMultimodalAmyotrophic Lateral SclerosisHealthcareMriConnectomicsBrain ImagingNeurodegeneration+1

0 views

Multimodal & LLM

MPCI-Bench: Multimodal Pairwise Contextual Integrity Evaluation for Language-Model Agents

MPCI-Bench is a benchmark for evaluating the contextual integrity of multimodal language-model agents. Each benchmark pair starts from a VISPR image and contains two contrastive information flows: one appropriate case and one inappropriate case, each represented at three levels of increasing context. The dataset was created by Soojuu and was last updated on Hugging Face in May 2026.

MultimodalAi EvaluationPrivacy UtilityBenchmarkContextual IntegrityComputer VisionMultimodal BenchmarkLanguage Model Agents+1

0 views

Multimodal & LLM

MedHorizon: 340 Full-Procedure Clinical Videos for Sparse Evidence Retrieval

MedHorizon is a long-context medical video benchmark created by DBD123 and last updated on 2026-05-07. It contains 340 full-procedure clinical videos paired with 1,253 multiple-choice question-answer pairs. The benchmark is designed to evaluate multimodal models on tasks requiring sparse evidence retrieval and multi-hop reasoning across long videos.

VideoMultimodalBenchmarkHealthcareMedical VideoVideo QaMultimodal BenchmarkClinical Procedures+1

0 views

Multimodal & LLM

Multimodal Driver Response Data Across Diverse Road Scenarios

EmoRoad provides anonymized clip and raw data capturing psychological, physiological, and behavioral human-subject responses in varied driving conditions. The 3.3 GB dataset was created by RCFCM Hong Kong and released as open access in April 2026. It integrates multiple sensor modalities to study driver states.

Time SeriesMultimodalZIPDriving BehaviorAffective ComputingMultimodal SensingHuman Factors+1

0 views

Multimodal & LLM

MedHorizon: 340 Full-Procedure Clinical Videos for Long-Context Evaluation

MedHorizon provides 340 full-procedure clinical videos paired with 1,253 multiple-choice questions for evaluating multimodal AI models. The benchmark emphasizes two challenging properties: extremely sparse evidence retrieval and multi-hop reasoning across observations distributed throughout lengthy procedures. It was created by mlvbench-review and last updated on Hugging Face in May 2026.

VideoMultimodalQa EvaluationBenchmarkHealthcareMedical VideoLong ContextMultimodal BenchmarkClinical Procedures+1

0 views

Multimodal & LLM

Human Action And Intent

A structured dataset of real-world VR forklift operation tasks, capturing aligned state, action, and outcome trajectories. It contains 384,950 timesteps at 50 Hz across 9 training episodes, created by fl-simulators and last updated on 2026-04-21. The data includes explicit intent, task structure, and reward signals for success, failure, and safety events.

TabularTime SeriesTelemetryVr SimulationXapiIntentHuman Action+1

0 views

Multimodal & LLM

CRDI Corporate Fraud Multimodal Dataset with 276 Cases

276 cases of corporate fraud data integrating geospatial, vocal-stress, and linguistic features. The dataset is hosted on Kaggle, but the author, organization, and creation date are unknown. Its specific collection methodology and temporal coverage are not detailed in the provided metadata.

GeospatialMultimodalLinguistic AnalysisCorporate FraudMultimodal FeaturesVocal Stress+1

0 views

Multimodal & LLM

Comparison of Representative Multimodal Fusion Methods

Chong Liu authored a comparative analysis of multimodal fusion methods, published on figshare. The dataset is a 9.5 KB Excel file last updated on April 24, 2026.

TabularExcelMachine LearningComparative AnalysisMultimodal Fusion+1

0 views

PreviousPage 18 of 97Next