DataSalon

Discover quality datasets for AI training — aggregated from 40+ platforms, curated by AI.

ProductSearch Datasets Browse Topics Rankings Community API / MCP

ResourcesDocumentation Blog Changelog Status

LegalPrivacy Policy Terms of Service Cookie Policy

Multimodal & LLM Datasets | DataSalon

All Categories

🔗

Multimodal & LLM

Image-text pairs, instruction tuning, visual QA, cross-modal data, foundation model training data

1,939 datasets

Multimodal & LLM

PerturbReason: Training Data for a Multimodal Virtual Cell Genetic Perturbation Model

PerturbReason is the training dataset for the AROMA model, a multimodal architecture for virtual cell modeling presented at ACL 2026. The dataset integrates textual evidence, graph topology, and protein sequences to predict the effects of genetic perturbations. It was authored by blazerye and last updated on Hugging Face in April 2026.

MultimodalGenetic PerturbationVirtual Cell ModelingMultimodal AiBioinformatics+1

0 views

Multimodal & LLM

Text-to-Video Model Rankings with 91,000 Human Preference Labels

Datapoint AI collected ~91,000 human ranking labels for text-to-video generation models. The dataset contains rankings for 5 videos per prompt across 3 quality dimensions, as judged by 15 annotators per dimension. It was last updated on Hugging Face in April 2026.

MultimodalText To VideoAi EvaluationHuman PreferencesModel RankingSynthetic+1

0 views

Multimodal & LLM

GitHub Documentation Dataset: 2,900+ Structured Files Across 15 Languages

GitHub documentation files totaling over 2,900 structured entries across 15 languages. The collection is optimized for training large language models and retrieval-augmented generation systems. The author, organization, and last update date are unknown.

TextMultilingualGithubSoftware DocumentationLlm TrainingSoftware DevelopmentDocumentationText Corpus+1

0 views

Multimodal & LLM

LLaVA-Med: Medical Visual Question Answering Dataset

A multimodal dataset for medical visual question answering, published on Kaggle. The dataset likely contains pairs of medical images and associated textual questions and answers. Specific details on size, source, and creation date are not provided in the available metadata.

MultimodalMedical ImagingMultimodal AiMedical Vision LanguageClinical Qa+1

0 views

Multimodal & LLM

Spectra: Multimodal VQA Training Data for Science and Open-World Knowledge

Spectra is a multimodal question-answering training dataset designed for vision-language models. It combines graduate-level science questions from TQA and ScienceQA with open-world knowledge questions from OKVQA and science questions across physics, chemistry, math, and biology from AI2D. The dataset was created by Tamalmajumder and was last updated on April 18, 2026.

MultimodalTraining DataOpen World KnowledgeMultimodal VqaScience Questions+1

0 views

Multimodal & LLM

Multimodal Video Annotation Samples with Abstracted Visual Assets

SuperviseLab provides professional video annotation data for training multimodal AI models. This public sample dataset demonstrates annotation methodology and output quality across diverse video content categories. All visual assets have been abstracted to protect source privacy, and identifiable metadata has been removed.

VideoMultimodalVideo AnnotationTraining DataMultimodal AiComputer Vision+1

0 views

Multimodal & LLM

CVQAD: 1,962 FullHD Videos for Compression Artifact Evaluation

1,962 FullHD videos with YUV420 encoding and durations of 10-15 seconds form the open part of the MSU compression artifacts dataset. The dataset, developed by deepfakesMSU, includes videos at frame rates of 24, 25, 30, 39, 50, and 60 fps for evaluating video quality metrics. The full description is available on the dataset page, and the dataset was last updated on April 14, 2026.

VideoNo Reference MetricsBenchmarkComputer VisionFull Reference MetricsVideo QualityCompression Artifacts+1

0 views

Multimodal & LLM

LLaVA-Med-v1.5-Mistral-7B: A Vision-Language Model for Medical AI

LLaVA-Med-v1.5-Mistral-7B is a dataset likely containing a model or associated data for a large vision-language model specialized in medical applications. The dataset is hosted on Kaggle, but its specific contents, scale, and creation details are not provided in the available metadata. Columns, sample data, and authorship information are unknown.

MultimodalVision Language ModelMultimodal AiLarge Language ModelMedical Ai+1

0 views

Multimodal & LLM

My-WavLM-Model: A Speech Representation Model

A WavLM model for audio processing, published on Kaggle. The dataset likely contains model weights or related artifacts for speech representation learning. Specific details on the model's architecture, training data, and performance are not provided in the available metadata.

AudioMachine LearningAudio ModelSpeech Processing+1

0 views

Multimodal & LLM

DanQing100M: 100 Million Chinese Image-Text Pairs for Vision-Language Pre-training

DanQing100M is a large-scale Chinese vision-language dataset containing 100 million image-text pairs, totaling 12 terabytes. It was created by researchers including Hengyu Shen, Tiancheng Gu, and others from DeepGlint-AI, using web data from 2024 to 2025. The dataset is intended for vision-language pre-training tasks.

MultimodalChineseImage Text PairsWeb DataPre TrainingVision LanguageComputer VisionLarge Scale+1

0 views

Multimodal & LLM

ACC2026 Track2 Augmented VQA: Visual Question Answering Dataset

A dataset for the ACC2026 Track2 competition, likely focusing on augmented visual question answering. Published on Kaggle, its specific content, size, and creation details require verification after download. The dataset appears to be designed for tasks involving both visual and textual data.

MultimodalMultimodal AiComputer VisionAugmented RealityVisual Question Answering+1

0 views

Multimodal & LLM

GTPBD-MM: Multimodal Remote Sensing Data for Terraced Parcel Extraction

GTPBD-MM is the first multimodal benchmark for terraced scenes, integrating optical imagery, textual descriptions, and Digital Elevation Model (DEM) data. The dataset provides three levels of annotations: parcel, mask, and boundary. It was created by author wxqzzw and last updated on April 15, 2026.

GeospatialMultimodalLand ParcelsBenchmarkTerraced Agriculture+1

0 views

Multimodal & LLM

Multimodal Face Generation Data With Spatial And Semantic Conditioning

MMFace-DiT Dataset provides multimodal conditioning data for high-fidelity, controllable face synthesis. The dataset, created by BharathK333, includes spatial elements like masks and sketches paired with VLM-enriched semantic captions. It was accepted to CVPR 2026 and last updated in April 2026.

MultimodalComputer VisionFace GenerationSynthetic DataMultimodal Benchmark+1

0 views

Multimodal & LLM

Synthetic Multimodal Data for Anaemia Screening

Kaggle hosts a synthetic dataset for anaemia screening. The data is multimodal, likely containing a combination of data types such as images, text, or tabular records. Its synthetic nature suggests it was generated for research and development purposes, though specific details on size, origin, and creation date are unavailable.

MultimodalMultimodal MedicalAnaemia ScreeningSynthetic DataSyntheticMedical Diagnosis+1

0 views

Multimodal & LLM

ACC2026 Track2 Qwen VQA Training Dataset v1

A dataset likely designed for the ACC2026 Track2 competition, focusing on Visual Question Answering (VQA). It is associated with the Qwen model and is published on Kaggle. The specific content, size, and collection details are not provided in the available metadata.

MultimodalTraining DataQwenVision LanguageMultimodal AiVqa+1

0 views

Multimodal & LLM

Bangla News Headlines with Paired Images for Multimodal Classification

20,000 Bangla news headlines are paired with corresponding images for multimodal classification tasks. The dataset is hosted on Kaggle, but details about its author, organization, and creation date are unknown. Column-level documentation and file formats are also unspecified.

MultimodalMachine LearningMultimodal ClassificationNews ArticlesComputer VisionBangla LanguageNatural Language Processing+1

0 views

Multimodal & LLM

FashionMV: Multi-View Fashion Images for Composed Image Retrieval

FashionMV is a large-scale dataset for product-level Composed Image Retrieval (CIR) created by yuandaxia. It contains 127,000 products, 472,000 multi-view images, and over 220,000 CIR triplets, built through an automated pipeline leveraging large multimodal models. The dataset was last updated on April 14, 2026.

MultimodalFashionComputer VisionLarge Scale+1

0 views

Multimodal & LLM

Hateful Memes Fine-Grained Dataset with 2,030 Multimodal Examples

2,030 memes form a fine-grained extension of the Hateful Memes dataset, annotated for nuanced analysis of harmful content. It was created by nils-herrmann and last updated on 2026-04 08. The dataset introduces annotation dimensions for incivility and intolerance beyond binary hatefulness.

MultimodalHate SpeechContent ModerationMeme Analysis+1

0 views

Multimodal & LLM

Deepfire X Stanford Wildfire Spread Dataset: Per-Day Rasters for U.S. Fires

Multimodal per-fire and per-day raster data covering U.S. wildfire spread from 2016 to 2025. The dataset is hosted on Kaggle and appears to be a collaboration between Deepfire and Stanford. It provides daily snapshots of fire progression.

GeospatialMultimodalEnvironmental scienceMultimodal DataGeospatial RastersWildfire Spread+1

0 views

Multimodal & LLM

VizWiz-VQA-Grounding: Visual Question Answering with Grounding Annotations

VizWiz-VQA-Grounding is a dataset likely designed for visual question answering tasks. It appears to be hosted on Kaggle, but detailed metadata about its size, structure, and creation details are unavailable. The title suggests it contains images paired with questions and answers, potentially with grounding annotations linking answers to specific image regions.

MultimodalMultimodal AiImage GroundingComputer VisionAccessibilityVisual Question Answering+1

0 views

PreviousPage 25 of 97Next