DataSalon

Discover quality datasets for AI training — aggregated from 40+ platforms, curated by AI.

ProductSearch Datasets Browse Topics Rankings Community API / MCP

ResourcesDocumentation Blog Changelog Status

LegalPrivacy Policy Terms of Service Cookie Policy

Multimodal & LLM Datasets | DataSalon

All Categories

🔗

Multimodal & LLM

Image-text pairs, instruction tuning, visual QA, cross-modal data, foundation model training data

1,956 datasets

Multimodal & LLM

Supervised Fine-Tuning and Instruction-Following Data for LLMs

Instruction tuning data for large language models, sourced from Kaggle. The dataset's specific size, format, and content details are not provided in the metadata. Its primary purpose is to support the supervised fine-tuning process for aligning model outputs with human instructions.

TextFine TuningLarge Language ModelSupervised Learning+1

0 views

Multimodal & LLM

Trained-VLM-Config: Vision-Language Model Configuration Files

Trained-vlm-config is a dataset hosted on Kaggle. The title suggests it contains configuration files or parameters for a trained Vision-Language Model. The dataset's specific contents, scale, and authorship are not detailed in the provided metadata.

MultimodalMachine LearningVision Language ModelModel Configuration+1

0 views

Multimodal & LLM

Unified Multimodal Emotion Dataset for Affective Computing

A multimodal dataset focused on emotion recognition, published on Kaggle. The dataset likely contains data from multiple modalities such as text, audio, or images, aligned for emotion analysis. Specific details on volume, collection method, and authorship are not provided in the available metadata.

MultimodalPre Trained ModelAffective ComputingEmotion Recognition+1

0 views

Multimodal & LLM

Multimodal Emotion Recognition Dataset

A dataset for emotion recognition, likely containing multiple data modalities such as text, audio, or images. It is hosted on Kaggle and may be associated with a pre-trained model. The specific volume, source, and creation date are not detailed in the available metadata.

MultimodalPre Trained ModelAffective ComputingEmotion Recognition+1

0 views

Multimodal & LLM

my_vqa_dataset12: Visual Question Answering Dataset

my_vqa_dataset12 is a dataset about Visual Question Answering (VQA). It is published on Kaggle. The dataset's specific content, size, and authorship are unknown.

MultimodalComputer VisionNatural Language ProcessingVisual Question Answering+1

0 views

Multimodal & LLM

Universe Multimodal Cleaned Dataset

Universe_multimodal_cleaned is a dataset published on Kaggle. The title suggests it contains cleaned, multimodal data, likely combining multiple data types such as text, images, or audio. Specific details on its size, origin, and creation date are not provided in the available metadata.

MultimodalMachine LearningCleaned Data+1

0 views

Multimodal & LLM

ORCA RLHF: Reinforcement Learning from Human Feedback Data

ORCA RLHF is a dataset hosted on Kaggle, likely related to training large language models using reinforcement learning from human feedback. The dataset's specific content, size, and structure are not detailed in the provided metadata. Its origin and creation methodology are also unspecified.

TextReinforcement LearningLarge Language ModelsHuman Feedback+1

0 views

Multimodal & LLM

Multimodal Emotions Dataset

A Kaggle-hosted dataset focused on emotions. The dataset likely contains multimodal data, such as text, audio, or images, related to emotional states. Its specific content, size, and creation details are not provided.

MultimodalPre Trained ModelEmotion+1

0 views

Multimodal & LLM

BLIP Model: Vision-Language Pre-training Data or Weights

BLIP Model is a dataset or model artifact related to the BLIP (Bootstrapping Language-Image Pre-training) framework, hosted on Kaggle. The specific content, such as pre-training data, model weights, or fine-tuning examples, is not detailed in the available metadata. Its origin and creation date are unknown.

MultimodalVision LanguageMultimodal AiComputer VisionImage Captioning+1

0 views

Multimodal & LLM

Multimodal Dataset LessIsMore

Multimodal-dataset-lessismore is a dataset hosted on Kaggle. Its title suggests it contains multiple data types, such as images, text, or audio, combined for machine learning tasks. The dataset's specific content, scale, and origin are not detailed in the available metadata.

MultimodalMachine LearningAi Training+1

0 views

Multimodal & LLM

Agri VLM Dataset: Agricultural Vision-Language Data

Agri VLM Dataset is a multimodal dataset likely containing agricultural imagery paired with textual descriptions, sourced from Kaggle. The dataset's specific size, content details, and creation date are not provided in the available metadata. Its purpose appears to be for training and evaluating vision-language models on agricultural concepts.

MultimodalVision Language ModelComputer VisionAgricultureNatural Language Processing+1

0 views

Multimodal & LLM

VQA Model Scripts for Visual Question Answering

Kaggle hosts this collection of scripts and resources for a Visual Question Answering (VQA) model. The dataset's specific content, size, and authorship are not detailed in the provided metadata. It is categorized on the platform as a 'Pre Trained Model' resource.

MultimodalPre Trained ModelComputer VisionNatural Language ProcessingVqa+1

0 views

Multimodal & LLM

ArSyra: Arabic Instruction Tuning Dataset for LLMs

Instruction tuning data for fine-tuning large language models on Arabic language tasks. The dataset is hosted on Kaggle, but its specific size, creation date, and authorship are not provided in the available metadata. Columns and sample data are unknown, limiting immediate assessment of its content and structure.

TextFine TuningArabic NlpLarge Language Models+1

0 views

Multimodal & LLM

MedVLM-Src: Source Data for Medical Vision-Language Models

MedVLM-Src is a dataset published on Kaggle. The title suggests it contains source data for training or evaluating medical vision-language models. The dataset's specific content, scale, and origin require verification after download.

MultimodalMedical ImagingVision Language ModelsMultimodal AiClinical Text+1

0 views

Multimodal & LLM

Multimodal Data in CSV Format

A dataset titled 'Multimodal_csv' is available on Kaggle. The dataset's specific content, size, and origin are not detailed in the provided metadata. Further verification is required to confirm the exact nature and composition of the multimodal elements.

TabularMultimodalMachine Learning+1

0 views

Multimodal & LLM

ttv_sp_llava_final: Multimodal Vision-Language Data for AI

A dataset titled 'ttv_sp_llava_final' published on Kaggle. The title suggests it is a final version of data related to the LLaVA (Large Language-and-Vision Assistant) model, likely containing multimodal content for vision-language tasks. Metadata is minimal; the specific content, size, and origin require verification after download.

MultimodalVision LanguageLlavaMultimodal Ai+1

0 views

Multimodal & LLM

GVLM-Data: A Vision-Language Dataset for AI Training

GVLM-Data is a dataset hosted on Kaggle. The dataset's title suggests it is likely related to General Vision-Language Models. Its specific content, size, and origin are not detailed in the available metadata.

MultimodalVision LanguageAi TrainingGvlm+1

0 views

Multimodal & LLM

MicroVQA++: A Large-Scale, High-Quality Microscopy Visual Question Answering Dataset

MicroVQA++ is a three-stage, large-scale and high-quality microscopy visual question answering corpus derived from biomedical imaging sources. The dataset, created by author ieellee and last updated on 2025-12-14, is designed to address the scarcity of training data for scientific reasoning in microscopy with multimodal large language models.

MultimodalOPTIMIZED-PARQUETParquetLibrarypolarsSize Categoriesn1 KModalitytextWeakly SupervisedBiomedical ImagingLibrarymlcroissantModalityimageMultimodal LlmLicensecc By Sa 40Arxiv251111407LibrarydatasetsLibrarypandasMicroscopyRegionusLarge ScaleNatural Language ProcessingVisual Question Answering+1

0 views

Multimodal & LLM

Multimodal Voice and Demographic Features for Diabetes Prediction, 1600 Samples

1600 voice samples are paired with 304 voice features and 4 demographic variables for diabetes prediction. The dataset is hosted on Kaggle and includes platform tags suggesting a focus on deep learning applications and synthetic data. Its multimodal nature combines audio signal processing with demographic information.

TabularAudioMultimodalAsiaDemographic DataVoice FeaturesDiabetesMultimodal HealthDiabetes PredictionDeep LearningSynthetic+1

0 views

Multimodal & LLM

PixelProse: 16.9 Million Dense Synthetic Image Captions

PixelProse contains 16,896,214 image-caption pairs featuring dense synthetic descriptions generated by Gemini 1.0 Pro Vision. Released in 2024 by researchers at the University of Maryland (tomg-group-umd), the collection provides detailed textual representations for images sourced from CommonPool and CC12M.

ParquetLibrarypolarsTask Categoriesimage To TextLibrarydaskLanguageenTask Categoriesvisual Question AnsweringSize Categories10 Mn100 MTask Categoriestext To ImageModalitytextModalitytabularLibrarymlcroissantModalityimageLibrarydatasetsLicensecc By 40Doi1057967hf2892CroissantRegionusArxiv240610328+1

0 views

PreviousPage 61 of 98Next