DataSalon

Discover quality datasets for AI training — aggregated from 40+ platforms, curated by AI.

ProductSearch Datasets Browse Topics Rankings Community API / MCP

ResourcesDocumentation Blog Changelog Status

LegalPrivacy Policy Terms of Service Cookie Policy

Multimodal & LLM Datasets | DataSalon

All Categories

🔗

Multimodal & LLM

Image-text pairs, instruction tuning, visual QA, cross-modal data, foundation model training data

1,956 datasets

Multimodal & LLM

RL GSPO Qwen2.5VLM Staged Code: Reinforcement Learning for Vision-Language Models

A dataset from Kaggle related to reinforcement learning (RL) for the Qwen2.5 Vision-Language Model (VLM). The dataset's title suggests it involves staged code, likely pertaining to training procedures or generated outputs. The specific content, scale, and authorship require verification after download.

MultimodalVision Language ModelCode GenerationReinforcement LearningStaged Training+1

0 views

Multimodal & LLM

SearchVLM: Vision-Language Model Search and Retrieval Data

SearchVLM is a dataset published on Kaggle. The title suggests it relates to vision-language models, likely containing data for search and retrieval tasks. Specific details on size, creator, and temporal coverage are not provided in the available metadata.

MultimodalVision Language ModelAi BenchmarkMultimodal Search+1

0 views

Multimodal & LLM

CURATED_VLM_DATASETS_987486: A Collection for Vision-Language Model Training

CURATED_VLM_DATASETS_987486 is a dataset collection published on Kaggle. Its title suggests it contains data for training and evaluating Vision-Language Models. The specific contents, size, and origin are not detailed in the provided metadata.

MultimodalVision Language ModelsCurated DatasetsMultimodal Ai+1

0 views

Multimodal & LLM

Cultural Performance and Tradition Reproduction Data for Heritage Preservation

Digital heritage data focuses on the preservation of cultural performances and traditions. The dataset's size, author, and last update date are not specified. It is hosted on the Kaggle platform.

MultimodalMultimodal DataPerformance ArtCultural HeritageTradition Preservation+1

0 views

Multimodal & LLM

BLIP Captions Output: Image Captions Generated by a Vision-Language Model

Kaggle hosts this dataset titled 'blipcaptionsoutput'. The title suggests it contains image captions generated by the BLIP (Bootstrapping Language-Image Pre-training) model. The dataset's scale, origin, and specific content are not detailed in the provided metadata.

MultimodalBlip ModelMultimodal AiComputer VisionImage Captioning+1

0 views

Multimodal & LLM

MedVQA-GI-2026: Gastrointestinal Medical Visual Question Answering

Kaggle hosts the MedVQA-GI-2026 dataset. It is a multimodal dataset for medical visual question answering, specifically focused on gastrointestinal topics. The dataset's author, organization, and specific scale are not provided in the metadata.

MultimodalMultimodal QaVision LanguageMedical AiMedical VqaGastrointestinal+1

0 views

Multimodal & LLM

Puffin-4M Multimodal Vision-Language-Camera Dataset

Puffin-4M is a large-scale, high-quality dataset containing 4 million samples for camera-centric multimodal understanding and generation. It integrates vision, language, and camera modalities to address the scarcity of benchmarks in spatial multimodal intelligence. The dataset was created by KangLiao and was last updated in January 2026.

WEBDATASETTask Categoriesimage To 3dTask Categoriesimage To TextSpatial IntelligenceLibrarywebdatasetTask Categoriestext To ImageModalitytextSize Categories1 Bn10 BGenerationLibrarymlcroissantModalityimageLibrarydatasetsUnified Multimodal ModelRegionus3 D VisionCamera CentricArxiv251008673Task Categoriesimage To ImageUnderstanding+1

0 views

Multimodal & LLM

Instruction Following Prompts for Language Model Evaluation

Nemotron-RL-instruction_following combines prompts from the WildChat-1M dataset with verifiable instructions from the Open-Instruct code base. Created by NVIDIA, this dataset is designed for training and evaluating models on objective instruction adherence. It was last updated in January 2026.

TextAi TrainingInstruction Following+1

0 views

Multimodal & LLM

TAOBAO-MM: Large-Scale E-Commerce User Interaction Sequences with Multimodal Embeddings

TAOBAO-MM is a large-scale recommendation dataset derived from user interaction logs on Taobao, one of the world's largest e-commerce platforms. It features historical behavior sequences of up to 1,000 interactions per user and includes high-quality multimodal embeddings. The dataset was authored by TaoBao-MM and was last updated on the Hugging Face platform on 2026-01-15.

MultimodalParquetLibrarypolarsLibrarydaskArxiv251207216RecommendationSize Categories10 Mn100 ME CommerceModalitytabularLibrarymlcroissantUser BehaviorLibrarydatasetsRecommendation SystemsArxiv240719467RegionusLarge ScaleLong SequenceMultimodal EmbeddingsLicenseapache 20+1

0 views

Multimodal & LLM

ActionDetectionDatasetVLM: Video Data for Vision-Language Model Training

ActionDetectionDatasetVLM is a dataset published on Kaggle. Its title suggests it contains video data annotated for action detection tasks, likely intended for training or evaluating vision-language models. The dataset's specific content, size, and origin require verification after download.

MultimodalVideo AnalysisVision Language ModelsAction DetectionComputer Vision+1

0 views

Multimodal & LLM

Synthvision Medical VQA: Synthetic Medical Images with Questions and Answers

Kaggle hosts the synthvision_medical_vqa dataset, which likely contains synthetic medical images paired with questions and answers for visual question answering tasks. The dataset's author, organization, and specific scale are unknown. Its last update date is also unspecified.

MultimodalMedical ImagingMedical VisionMultimodal AiQuestion AnsweringAi TrainingVisual Question Answering+1

0 views

Multimodal & LLM

LLaVA-2: Vision-Language Dataset for Multimodal AI

LLaVA-2 is a dataset hosted on Kaggle, likely related to vision-language tasks and multimodal AI. Its specific content, scale, and creation details are not provided in the available metadata. The dataset appears to be intended for training or benchmarking large language models with visual capabilities.

MultimodalVision LanguageMultimodal AiLlm Training+1

0 views

Multimodal & LLM

LLaVA-3: Vision-Language Model Training Data

Kaggle hosts the LLaVA-3 dataset, a resource for multimodal AI development. The dataset likely contains paired image and text data for training vision-language models. Its specific size, creator, and update history are not detailed in the provided metadata.

MultimodalVision LanguageLlavaMultimodal AiImage CaptioningLlm Training+1

0 views

Multimodal & LLM

Spa3R Vision-Language Model Dataset

Spa3R Vlm is a dataset for vision-language model tasks, hosted on HuggingFace by the author hustvl. The dataset was last updated on March 6, 2026.

MultimodalVision Language ModelRegion:usBenchmarkComputer VisionNatural Language Processing+1

0 views

Multimodal & LLM

SPRITE: Spatial Reasoning and Embodied Intelligence for VLMs

SPRITE is a spatial reasoning dataset for Vision-Language Models (VLMs) developed by zhihelu and released in early 2026. It provides image-text pairs designed to improve embodied intelligence by balancing linguistic diversity with computational precision, as detailed in Arxiv paper 2512.16237.

Arxiv251216237RegionusLicenseapache 20+1

0 views

Multimodal & LLM

Mixbench2026: A Benchmark for Mixed Modality Retrieval

4 distinct subsets including MSCOCO and VisualNews provide multimodal queries and documents for cross-modal retrieval evaluation. The dataset utilizes queries.jsonl files to benchmark performance on text-only, image-only, and combined image-text search tasks.

ImageMultimodalParquetTextSize Categories10 Kn100 KLibrarypolarsLanguageenTask Categoriestext RankingModalitytextLibrarymlcroissantTask Idsdocument RetrievalModalityimageLibrarydatasetsBenchmarkLibrarypandasRetrievalRegionusMultilingualitymonolingualLicensemit+1

0 views

Multimodal & LLM

BLIP2-OPT-27B: A Vision-Language Model for Image-to-Text Tasks

BLIP2-OPT-27B is a large-scale vision-language model likely designed for tasks like image captioning and visual question answering. The dataset appears to be hosted on Kaggle, but its specific contents, such as training data or model weights, are not detailed in the provided metadata. Further inspection is required to confirm the exact data format and scope.

MultimodalVision LanguageMultimodal AiImage CaptioningBlip2Opt 27b+1

0 views

Multimodal & LLM

Data Nanovlm: A Multimodal AI Benchmark Dataset

Data Nanovlm is a dataset published on the Hugging Face platform by the author LMMs-Lab-Speedrun. The dataset was last updated on February 27, 2026. Its specific content and scale are not detailed in the available metadata.

MultimodalMachine LearningBenchmark DataVision LanguageMultimodal Ai+1

0 views

Multimodal & LLM

WavLM-Large: A Large-Scale Speech Representation Model

WavLM-Large is a model for speech representation learning, published on Kaggle. The dataset's specific content, size, and origin require verification after download.

AudioMachine LearningAudio ModelSpeech Processing+1

0 views

Multimodal & LLM

LIMO_VQA: A Visual Question Answering Dataset

LIMO_VQA is a dataset for Visual Question Answering (VQA) tasks, likely containing pairs of images and associated questions. The dataset is hosted on Kaggle, a popular platform for data science competitions and projects. Specific details on its size, creation date, and authors are not provided in the available metadata.

MultimodalMultimodal AiComputer VisionNatural Language ProcessingVisual Question Answering+1

0 views

PreviousPage 54 of 98Next