DataSalon

Discover quality datasets for AI training — aggregated from 40+ platforms, curated by AI.

ProductSearch Datasets Browse Topics Rankings Community API / MCP

ResourcesDocumentation Blog Changelog Status

LegalPrivacy Policy Terms of Service Cookie Policy

Multimodal & LLM Datasets | DataSalon

All Categories

🔗

Multimodal & LLM

Image-text pairs, instruction tuning, visual QA, cross-modal data, foundation model training data

1,956 datasets

Multimodal & LLM

Federated Multimodal Deep Learning for Real-Time Sepsis Risk Prediction

A dataset for developing real-time sepsis risk prediction models using a federated learning approach. The data likely contains multimodal clinical information from intensive care units, though specific sources and scale are not detailed. It was sourced from Kaggle under the 'Research' tag, but the author, organization, and last update date are unknown.

MultimodalClinical PredictionMultimodal DataResearchSepsis RiskHealthcare AiFederated Learning+1

0 views

Multimodal & LLM

ZeroBench: High-Difficulty Visual Reasoning Tasks for LMM Evaluation

ZeroBench is a visual reasoning benchmark containing fewer than 1,000 image-text pairs designed to challenge contemporary Large Multimodal Models (LMMs). Created by Jonathan Roberts and associated with Arxiv paper 2502.09696, the dataset was updated in December 2025 to include refined hierarchical question structures. It focuses on tasks that were considered nearly unsolvable for multimodal models at the time of its release.

ParquetTask Categoriesimage Text To TextLibrarypolarsSize Categoriesn1 KModalitytextLibrarymlcroissantModalityimageLibrarydatasetsLibrarypandasRegionusArxiv250209696+1

0 views

Multimodal & LLM

LLaVA-CoT-100k: Vision-Language Reasoning Dataset

Introduced in the paper 'LLaVA-CoT: Let Vision Language Models Reason Step-by-Step', this dataset is designed to enable Vision-Language Models to perform autonomous multistage reasoning. It integrates 100,000 samples from various visual question-answering sources with structured reasoning annotations. The dataset was authored by Xkev and last updated on the Hugging Face platform in December 2025.

MultimodalVision Language ModelsMultimodal AiComputer VisionReasoningVisual Question Answering+1

0 views

Multimodal & LLM

Multimodal Wearable-Based Panic Episode Detection

Multimodal wearable-based detection of panic episodes combines EEG and other sensor data. The dataset likely contains physiological signals collected from wearable devices. It is hosted on Kaggle and tagged for research purposes.

MultimodalWearable SensorsMultimodal DataPanic DetectionEegResearch+1

0 views

Multimodal & LLM

Hybrid Multimodal Fault Diagnosis for HVAC Systems

A hybrid multimodal dataset for diagnosing faults in Heating, Ventilation, and Air Conditioning (HVAC) systems. The dataset is associated with a research paper proposing a Bayesian Tensor‑Network approach. It was sourced from Kaggle and is categorized under the 'Research' tag.

MultimodalHvac SystemsFault DiagnosisMultimodal DataResearchTensor NetworksBayesian Models+1

0 views

Multimodal & LLM

Wafermap VQA With Rubrics 2602 V2: Visual Question Answering for Semiconductor Defects

A dataset titled 'Wafermap Vqa With Rubrics 2602 V2' published on HuggingFace by author Niraya666. The dataset was last updated on 2026-02-09. The title suggests it contains wafermap images and associated rubrics for visual question answering tasks, likely related to semiconductor manufacturing quality control.

MultimodalWafer DefectQuality controlSemiconductor ManufacturingVisual Question Answering+1

0 views

Multimodal & LLM

Multimodal Damage Identification for Humanitarian Computing

Multimodal Damage Identification for Humanitarian Computing is a dataset from the UCI Machine Learning Repository. It is designed for assessing damage in disaster scenarios, likely combining multiple data types such as images and text. The dataset's creator and specific size are not detailed in the provided metadata.

MultimodalMultimodal DataHumanitarian AidDamage AssessmentDisaster Response+1

0 views

Multimodal & LLM

Lung Cancer Multimodal Imaging and Clinical Cohort for Treatment Response

A curated multimodal imaging and clinical cohort of lung cancer patients undergoing chemotherapy or immunotherapy, hosted by FangDai. The dataset is designed for research in CT segmentation, PET/CT modeling, radiomics, and survival analysis. It was last updated on December 29, 2025.

MultimodalMedical ImagingLung cancerHealthcareComputer VisionSurvival AnalysisClinical Data+1

0 views

Multimodal & LLM

Visual Question Answering Subset with 30,000 Images

Encompassing 30,000 images from the GQA dataset, intended for training Visual Question Answering models. It is tagged for scene understanding and computer vision tasks, with associated English text.

ImageTextEnglishComputer VisionScene UnderstandingVisual Question Answering+1

0 views

Multimodal & LLM

Astronomy Images with JSON-Linked Captions for Multimodal Models

Aggregating astronomy images paired with text captions stored in JSON format, intended for fine-tuning Vision-Language Models (VLMs). It is tagged for applications in image captioning, computer vision, and multimodal AI. The specific number of rows, columns, and file size are unknown.

MultimodalIntermediateAstronomyComputer VisionData Cleaning+1

0 views

Multimodal & LLM

EarthDial: 10K-100K Evaluation Records for Remote Sensing VLMs

EarthDial-Dataset is a curated collection of 10,000 to 100,000 evaluation-only records for remote sensing and Earth observation, released by akshaydudhane and last updated in December 2024. It benchmarks vision-language models (VLMs) on real-world satellite and aerial imagery across tasks including classification, object detection, and change detection.

ArrowSize Categories10 Kn100 KTask Categoriesquestion AnsweringLanguageenModalitytextLibrarymlcroissantModalityimageLibrarydatasetsRegionusVlmLicenseapache 20Remote Sensing+1

0 views

Multimodal & LLM

C3: 90K Paired Floor Plans and Photos with Pixel-Level Correspondences

C3 is a cross-view cross-modality correspondence dataset containing 90,000 paired floor plans and photographs. It covers 597 scenes with 153 million pixel-level correspondences and 85,000 camera poses. The dataset was created by kwhuang and last updated on the platform in January 2026.

ImageMultimodalCSVSize Categories10 Kn100 KLibrarypolarsArxiv251118559Floor PlansModalitytextLibrarymlcroissantLibrarydatasetsLibrarypandasLicensecc By 40Computer VisionCross ViewCross ModalityRegionusPhotography+1

0 views

Multimodal & LLM

T2AV-Compass: A Unified Benchmark for Text-to-Audio-Video Generation

T2AV-Compass is a benchmark dataset created by NJU-LINK for evaluating Text-to-Audio-Video (T2AV) generation models. It targets unimodal quality, cross-modal alignment, complex instruction following, and perceptual realism. The dataset was last updated on December 25, 2025.

AudioMultimodalGenerative AiText To Audio VideoBenchmarkEvaluation DatasetMultimodal Benchmark+1

0 views

Multimodal & LLM

Turkish Image Visual Question Answering Pairs

Image Vqa Turkish is a dataset of Turkish visual question-answer pairs for multimodal AI tasks. It was created by the author 'ituperceptron' and was last updated on Hugging Face on January 14, 2026. The dataset structure includes images, unique image IDs, and VQA pairs in JSON format.

MultimodalImage VqaTurkish LanguageMultimodal QaComputer VisionNatural Language Processing+1

0 views

Multimodal & LLM

Building Defect VQA Dataset for Vision-Language Model Training

A Visual Question Answering dataset derived from the BD3 Building Defect Dataset. It pairs images of building surfaces with questions and defect category answers, designed for training and evaluating Vision-Language Models. The dataset was created by author 'chandrabhuma' and was last updated on December 27, 2025.

MultimodalVision Language ModelsComputer VisionBuilding DefectsConstructionVisual Question Answering+1

0 views

Multimodal & LLM

MDSM: MLLM-Driven Synthetic Multimodal Dataset

MLLM-Driven Synthetic Multimodal dataset (MDSM) is referenced in a research context titled "The Coherence Trap: When MLLM-Crafted Narratives Exploit Manipulate". The dataset likely contains synthetic multimodal data, potentially combining text and images. Its specific size, structure, and creation details are unknown.

MultimodalEnglishTextImage Text DetectionNewsText DetectionMllmSynthetic DataSynthetic+1

0 views

Multimodal & LLM

FGVQA: Fine-Grained Visual Question Answering Benchmark Suite

A benchmark suite introduced in the paper 'Same or Not? Enhancing Visual Perception in Vision-Language Models'. It contains 12,000 challenging (image, question, answer) tuples emphasizing fine-grained image understanding. The dataset is composed of six sub-benchmarks and is hosted by glab-caltech.

MultimodalVision Language ModelsBenchmarkComputer VisionFine Grained VisionBenchmark SuiteVisual Question Answering+1

0 views

Multimodal & LLM

OCT Scans and Annotations for Hydrogel-Treated Wound Healing in Mice

Aggregating Optical Coherence Tomography (OCT) scan data and human expert annotations for hydrogel-treated wounds in a mouse model. It includes raw OCT scans and corresponding tissue annotations.

OCTWound HealingMedicine Health And Life Sciences+1

0 views

Multimodal & LLM

Data Management Plan for Trustworthy Medical Foundation Models

Data Management and Sharing Plan for the POSE: Phase I research project, authored by Huaxiu Yao. It describes the scientific data to be generated and/or used in the research and outlines a strategy for managing and sharing project data. The specific data types, volume, and structure are not detailed.

0 views

Multimodal & LLM

AV-SpeakerBench: Audiovisual QA Benchmark with 1K-10K Speaker-Aware Clips

AV-SpeakerBench is an audiovisual question-answering benchmark containing between 1,000 and 10,000 records, released in December 2024 by researcher plnguyen2908. It features trimmed segments across audio-only, visual-only, and audiovisual modalities paired with speaker-aware annotations to test fine-grained reasoning in multimodal models.

MultimodalCSVSize Categories1 Kn10 KLibrarypolarsArxiv251202231Task Categoriesquestion AnsweringModalityaudioLanguageenTask Categoriesvisual Question AnsweringModalitytextLibrarymlcroissantLibrarydatasetsLibrarypandasQuestion AnsweringAudiovisualModalityvideoRegionusLicensemit+1

0 views

PreviousPage 57 of 98Next