DataSalon

Discover quality datasets for AI training — aggregated from 40+ platforms, curated by AI.

ProductSearch Datasets Browse Topics Rankings Community API / MCP

ResourcesDocumentation Blog Changelog Status

LegalPrivacy Policy Terms of Service Cookie Policy

Multimodal & LLM Datasets | DataSalon

All Categories

🔗

Multimodal & LLM

Image-text pairs, instruction tuning, visual QA, cross-modal data, foundation model training data

1,956 datasets

Multimodal & LLM

Surveillance Video Data for Anomaly and Crime Detection

Surveillance video data supports anomaly and crime detection tasks. The dataset is tagged for applications in security monitoring and video analysis. Specific details on volume, features, and origin are unavailable.

VideoSurveillanceAnomaly DetectionSecurity MonitoringCrime Detection+1

0 views

Multimodal & LLM

VQA-Validation-Data: Visual Question Answering Benchmark

A validation dataset for the Visual Question Answering (VQA) task, published on Kaggle. The dataset likely contains image-question-answer pairs designed to test models' ability to answer questions about visual content. The specific number of samples, data source, and creation date are not provided in the available metadata.

MultimodalComputer VisionNatural Language ProcessingVisual Question Answering+1

0 views

Multimodal & LLM

VIGOR Annotations for LLaVA: Vision-Language Grounding Data

VIGOR_annotations_llava is a dataset published on Kaggle. Its title suggests it contains annotations for the LLaVA (Large Language and Vision Assistant) model framework, likely linking images with descriptive text. The specific content, scale, and origin require verification after download.

MultimodalVision LanguageLlavaMultimodal AnnotationsImage Captioning+1

0 views

Multimodal & LLM

Vigor Annotations LLaVA: Vision-Language Annotations for AI Training

vigor_annotations_llava is a dataset hosted on Kaggle. The title suggests it contains annotations likely intended for training or evaluating vision-language models, such as those based on the LLaVA architecture. Specific details regarding the data volume, creation method, and update history are not provided in the available metadata.

MultimodalAnnotationsVision LanguageLlavaMultimodal Ai+1

0 views

Multimodal & LLM

Cognitive Load Assessment via Multimodal Inputs

Multimodal data likely collected for assessing cognitive load, a psychological state related to mental effort. The dataset is published on Kaggle, but its specific size, collection method, and authorship are unknown. Its content and structure require verification after download.

MultimodalPsychologyMultimodal DataCognitive LoadHuman Computer Interaction+1

0 views

Multimodal & LLM

Twitter Multimodal Rumor Dataset

A dataset from Kaggle focused on rumors circulating on the social media platform Twitter. The dataset likely contains multimodal content, such as text and associated images or videos, for analysis. Metadata is minimal; actual content, size, and collection details require verification after download.

MultimodalTwitterSocial MediaRumor Detection+1

0 views

Multimodal & LLM

Nemotron Instruction Following Chat V1: Multi-Turn Dialogue Data for LLM Training

Nemotron-Instruction-Following-Chat-v1 is designed to strengthen model capabilities in open-ended chat, precise instruction following, and structured output generation. It combines refreshed multi-turn chat data with synthetic dialogues generated by frontier models like GPT-OSS-120B and Qwen3-235B variants. The dataset was created by NVIDIA and last updated on December 15, 2025.

TextJSONLibrarypolarsLanguageenChat DialogueModalitytextSize Categories100 Kn1 MLibrarymlcroissantLibrarydatasetsLibrarypandasLicensecc By 40RegionusLlm TrainingSynthetic DataSynthetic+1

0 views

Multimodal & LLM

Self-Supervised Multimodal Time-Series Augmentation with Contrastive Adversarial

A research dataset for self-supervised learning on multimodal time-series data. It is designed for contrastive and adversarial augmentation techniques. The dataset's origin, size, and specific temporal coverage are not detailed in the provided metadata.

Time SeriesMultimodalContrastive LearningSelf Supervised LearningResearchAdversarial LearningMultimodal Time SeriesData Augmentation+1

0 views

Multimodal & LLM

Multimodal Radiomic-Genomic Fusion via Graph-Augmented Deep Learning

Kaggle hosts a dataset for multimodal radiomic-genomic fusion via graph-augmented deep learning for early prediction. The dataset's specific content, size, and origin are unknown. It is categorized as research data.

MultimodalRadiomicsMedical ImagingHealthcareGenomicsResearchMultimodal FusionGraph Learning+1

0 views

Multimodal & LLM

Colsmolvlm-instruct-500m-base: A 500 Million Parameter Instruction-Tuned Language Model

A language model dataset titled 'colsmolvlm-instruct-500m-base', published on Kaggle. The title suggests it is likely related to instruction tuning for a 500 million parameter language model. The dataset's specific content, size, and authorship are not detailed in the provided metadata.

TextText GenerationLanguage ModelLlm Training+1

0 views

Multimodal & LLM

Multimodal Concrete Crack Detection Dataset

Vision and audio data streams are categorized into structural crack classes for concrete integrity assessment. Synchronized multimodal inputs pair visual surface evidence with acoustic signatures to facilitate structural health monitoring research.

ImageAudioMultimodalAudio ClassificationImage Classification+1

0 views

Multimodal & LLM

XD-Violence Single-Label Edition

Multimodal video and audio recordings categorized into single-label violence classes. This dataset provides synchronized visual and auditory data streams to support the development of automated violence detection models.

VideoVideo ClassificationClassification+1

0 views

Multimodal & LLM

vi_vqa_animal_dataset

90 animal species are categorized within this Vietnamese-language Visual Question Answering (VQA) dataset. The collection pairs images of animals with corresponding Vietnamese text questions and answers to facilitate multimodal learning.

ImageTextAnimalsVietnameseDeep Learning+1

0 views

Multimodal & LLM

Astral-Math-v1: Multi-Model TIR Mathematical Reasoning Dataset

A large-scale collection of mathematical problems categorized for Multi-Model TIR tasks. It provides structured data for training and evaluating reasoning-based mathematical solvers through multi-step logic.

EnglishTextGeneral Knowledge And ReasoningMathematicsReinforcement Learning+1

0 views

Multimodal & LLM

Trendyol Cybersecurity Instruction Tuning Dataset with 53,202 Examples

53,202 instruction-tuning examples covering over 200 specialized cybersecurity domains, including cloud-native threats and AI/ML security. Created by the Trendyol Security Team for training defensive security AI assistants, this dataset was expanded from an initial 21,000 rows. The dataset was last updated on December 16, 2025.

TextJSONSize Categories10 Kn100 KTask Categoriestext GenerationLibrarypolarsTask Categoriesquestion AnsweringLanguageenCybersecurityModalitytextLibrarymlcroissantThreat IntelligenceLibrarydatasetsLibrarypandasSecurity OperationsRegionusLicenseapache 20Incident ResponseDefensive Security+1

0 views

Multimodal & LLM

NautData: 1.45 Million Underwater Image-Text Pairs for Instruction Tuning

NautData is a large-scale underwater instruction-following dataset containing 1.45 million image-text pairs. It was constructed by H-EmbodVis to bridge the gap in large-scale underwater multi-task instruction-tuning datasets. The dataset was introduced in the paper NAUTILUS and is intended for advancing underwater scene understanding methods.

MultimodalImage Text PairsUnderwater ImageryBenchmarkMultimodal TrainingComputer VisionLarge ScaleInstruction Following+1

0 views

Multimodal & LLM

VQA-SampleSub: Visual Question Answering Sample Subset

A sample subset of data for Visual Question Answering (VQA), a multimodal AI task. The dataset is hosted on Kaggle, but its specific size, origin, and update history are not detailed in the provided metadata. Content likely pairs images with corresponding questions and answer annotations.

MultimodalMultimodal AiComputer VisionNatural Language ProcessingVisual Question Answering+1

0 views

Multimodal & LLM

Therascribe Gold 1M: Research-Backed Medical Vision-Language Dataset for Fine-Tuning

753,715 medical image-text pairs totaling 49.37 GB, designed for fine-tuning models like LLaVA-Med++. The dataset, created by Kafoo and last updated in November 2025, is stored in JSONL format alongside its images. Its captions are notably concise, averaging 1.0 words in length.

MultimodalArxiv250805019Arxiv251116334Llm FinetuningMedical Vision LanguageArxiv241119688HealthcareComputer VisionLlava Med FinetuningArxiv250107171RegionusArxiv230600890Arxiv251115994Medical ImagesArxiv250302334Arxiv250216841Research Backed+1

0 views

Multimodal & LLM

Multimodal-VLM: Vision-Language Model Training Data

A dataset from Kaggle, likely containing paired image and text data for training or evaluating vision-language models. The specific content, scale, and creation details are not provided in the available metadata.

MultimodalVision LanguageAi TrainingVlm+1

0 views

Multimodal & LLM

Multimodal-VLM2: Vision-Language Model Training Data

A dataset titled 'multimodal-vlm2' hosted on Kaggle. The title suggests it contains data for training or evaluating Vision-Language Models, which typically integrate visual and textual information. The dataset's specific content, size, and origin are not detailed in the provided metadata.

MultimodalVision LanguageAi TrainingVlm+1

0 views

PreviousPage 59 of 98Next