DataSalon

Discover quality datasets for AI training — aggregated from 40+ platforms, curated by AI.

ProductSearch Datasets Browse Topics Rankings Community API / MCP

ResourcesDocumentation Blog Changelog Status

LegalPrivacy Policy Terms of Service Cookie Policy

Multimodal & LLM Datasets | DataSalon

All Categories

🔗

Multimodal & LLM

Image-text pairs, instruction tuning, visual QA, cross-modal data, foundation model training data

1,956 datasets

Multimodal & LLM

LLaVA Model: Vision-Language Model Weights

Kaggle hosts the LLaVA model, a multimodal AI system. The dataset likely contains model weights for a large language model with vision capabilities. The author, organization, and last update date are unknown.

MultimodalVision Language ModelMultimodal AiLarge Language ModelModel Weights+1

0 views

Multimodal & LLM

VQA-Autopilot: Visual Question Answering Dataset for Autonomous Systems

The VQA-Autopilot dataset is hosted on Kaggle. Its title suggests it contains data for visual question answering, a task combining computer vision and natural language processing, potentially for applications in autonomous systems. Metadata is minimal; actual content, scale, and authorship require verification after download.

MultimodalComputer VisionNatural Language ProcessingAutonomous SystemsVisual Question Answering+1

0 views

Multimodal & LLM

LLaVA Dataset: Vision-Language Data for Multimodal AI Training

A dataset named LLaVA, hosted on Kaggle, likely contains multimodal data for training vision-language models. The platform tags suggest it is intended for large language model (LLM) training and multimodal AI tasks. Specific details on size, structure, and creation are not provided in the available metadata.

MultimodalVision LanguageMultimodal AiLlm Training+1

0 views

Multimodal & LLM

data_vlm_diff_ready_30: Vision-Language Model Training Data

A dataset titled 'data_vlm_diff_ready_30' is hosted on Kaggle. The title suggests it is prepared for training or evaluating vision-language models, likely containing paired image and text data. Its specific content, size, and creation details are not provided in the available metadata.

MultimodalVision Language ModelMultimodal DataComputer VisionNatural Language Processing+1

0 views

Multimodal & LLM

Adversarial Multimodal Test Cases for LLM Validation

Adversarial test cases combine images and text to validate multimodal large language models. The dataset is designed to challenge evidence-based reasoning capabilities in models like Gemini. Its origin, size, and creation details are not specified.

MultimodalImage Text PairsModel validationAdversarial Testing+1

0 views

Multimodal & LLM

Multimodal Earnings Conference Call Data for Financial Analysis

A multimodal dataset likely containing transcripts and potentially audio recordings from corporate earnings conference calls. The dataset is hosted on Kaggle, but specific details about its size, source, and creation date are not provided in the available metadata. Its content suggests it is intended for analyzing corporate financial performance and communication.

MultimodalCorporate CommunicationsMultimodal DataEarnings CallsFinancial Analysis+1

0 views

Multimodal & LLM

WBC-AttrDescVQA: Visual Question Answering on White Blood Cell Images

WBC-AttrDescVQA is a dataset for Visual Question Answering (VQA) tasks, likely involving images of white blood cells (WBCs). The dataset is hosted on Kaggle, but its specific scale, creation date, and authorship are not detailed in the provided metadata. Its content and structure must be verified after download.

MultimodalMedical ImagingMultimodal AiWhite Blood CellsVisual Question Answering+1

0 views

Multimodal & LLM

WildFire_VQA: Visual Question Answering for Wildfire Scenes

A dataset titled 'WildFire_VQA' is hosted on Kaggle. The dataset likely contains image and text pairs for visual question answering tasks related to wildfire scenes. Metadata is minimal; the specific number of samples, data source, and creation date are unknown.

MultimodalSatellite ImageryVisual Question AnsweringWildfireDisaster Response+1

0 views

Multimodal & LLM

Image Captions for Nova 2

Image captions for nova 2 is a dataset published on Kaggle. The title suggests it likely contains descriptive text paired with images. Metadata is minimal; actual content requires verification after download.

MultimodalImage CaptionsComputer Vision+1

0 views

Multimodal & LLM

Chart2Code: 2,023 Hierarchical Tasks for Chart-to-Code Generation

Chart2Code is a benchmark of 2,023 tasks designed to evaluate Large Multimodal Models (LMMs) on chart understanding and code generation, released by CSU-JPG in 2026. The dataset is structured into three hierarchical difficulty levels containing 863, 1,010, and 150 tasks respectively. It maps visual data visualizations to executable code to test the reasoning capabilities of multimodal systems.

MultimodalTask Categoriesimage Text To TextLanguageenChart UnderstandingLlm EvaluationCode GenerationChartsRegionusArxiv251017932+1

0 views

Multimodal & LLM

RL GSPO Qwen2.5VLM PhaseB Best Composite 180: Vision-Language Model Benchmark

RL GSPO Qwen2.5VLM PhaseB Best Composite 180 is a dataset published on Kaggle. The title suggests it is likely a benchmark or evaluation dataset for a vision-language model, possibly related to reinforcement learning. The dataset's specific content, size, and origin are unknown from the provided metadata.

MultimodalVision Language ModelBenchmarkAi TrainingReinforcement Learning+1

0 views

Multimodal & LLM

RL GSPO Qwen2.5VLM Staged Code V2: Reinforcement Learning Dataset

RL GSPO Qwen2.5VLM Staged Code V2 is a dataset hosted on Kaggle. The title suggests it relates to reinforcement learning (RL) and staged training for a vision-language model (VLM) named Qwen2.5. The dataset likely contains data used for training or evaluating such models.

MultimodalVision Language ModelCode GenerationReinforcement LearningStaged Training+1

0 views

Multimodal & LLM

Vlms Are Biased: Vision Language Model Performance on Counting Tasks

An academic dataset from KAIST, William and Mary, University of Alberta, and Auburn University, released in December 2025. It demonstrates a performance gap in state-of-the-art Vision Language Models (VLMs), which perform perfectly on counting tasks with original images but fail catastrophically on modified versions. The dataset is hosted on Hugging Face by author anvo25.

MultimodalAi EvaluationVision Language ModelsModel BiasComputer Vision+1

0 views

Multimodal & LLM

Orient Anything V2: Training Renderings for 3D Object Orientation

Orient Anything V2 is an enhanced foundation model for unified understanding of object 3D orientation and rotation from single or paired images. This repository contains the final rendering data used for training the model, as provided by author Viglong. The dataset was last updated on January 13, 2026.

ImagePoint CloudMultimodalWEBDATASETModality3d3d-understandingSize Categories1 Mn10 MLibrarywebdatasetModalitytextLibrarymlcroissantModalityimageLibrarydatasets3d OrientationLicensecc By 40Computer VisionTask CategoriesotherRegionusArxiv260105573Image PairsOrientation Estimation+1

0 views

Multimodal & LLM

VQA_VIDEOS: Visual Question Answering Video Dataset

VQA_VIDEOS is a dataset hosted on Kaggle. The title suggests it contains video content paired with questions and answers for visual question answering tasks. The dataset's specific size, content details, and origin are not provided in the available metadata.

VideoMultimodalMultimodal AiComputer VisionNatural Language ProcessingVisual Question Answering+1

0 views

Multimodal & LLM

Multimodal Emotion Dialogue Records with Speech and Image Data

Multimodal Emotion Dialogue Dataset is a collection of records for analyzing emotional states in conversations. The dataset likely contains speech, image, and interaction data, as indicated by its raw description. It is hosted on Kaggle, but specific details on its size, creation, and update history are not provided.

AudioMultimodalDialogue SystemsMultimodal DataEmotion RecognitionComputer VisionHuman Computer Interaction+1

0 views

Multimodal & LLM

BlipCaptioningOutput: Image Captioning Data from BLIP Model

Kaggle hosts this dataset titled 'BlipCaptioningOutput'. The title suggests it contains outputs from the BLIP (Bootstrapping Language-Image Pre-training) model, likely pairing images with generated or ground-truth captions. No further metadata on size, source, or creation date is provided.

MultimodalBlip ModelComputer VisionImage CaptioningNatural Language Processing+1

0 views

Multimodal & LLM

SciCap Scientific Image and Caption Dataset

SciCap Dataset provides pairs of scientific images with corresponding captions. It is designed for training and evaluating multimodal models. The dataset was created for research in scientific image understanding.

MultimodalMachine LearningMultimodal DataComputer VisionImage CaptioningScientific Images+1

0 views

Multimodal & LLM

WavLM Region Model: Audio Feature Representations

A dataset titled 'wavlm_region_model' is hosted on Kaggle. The dataset likely contains audio feature representations or model outputs from the WavLM architecture. Metadata is minimal; actual content, size, and structure require verification after download.

AudioMultimodalMachine LearningSpeech ModelAudio Processing+1

0 views

Multimodal & LLM

WavLM Gender Model: Audio Data for Gender Classification

A Kaggle-hosted dataset titled 'wavlm_gender_model'. The dataset's content likely relates to audio data processed by the WavLM architecture for gender classification tasks. Metadata is minimal; the specific number of samples, audio characteristics, and creation details require verification after download.

AudioMachine LearningSpeech AnalysisAudio Processing+1

0 views

PreviousPage 53 of 98Next