DataSalon

Discover quality datasets for AI training — aggregated from 40+ platforms, curated by AI.

ProductSearch Datasets Browse Topics Rankings Community API / MCP

ResourcesDocumentation Blog Changelog Status

LegalPrivacy Policy Terms of Service Cookie Policy

Multimodal & LLM Datasets | DataSalon

All Categories

🔗

Multimodal & LLM

Image-text pairs, instruction tuning, visual QA, cross-modal data, foundation model training data

1,956 datasets

Multimodal & LLM

Audio Embedding WavLM Large: Precomputed Audio Features

Audio embeddings generated by the WavLM-Large model, a transformer-based architecture for audio representation learning. The dataset likely contains precomputed feature vectors for audio samples, facilitating downstream machine learning tasks. It is hosted on Kaggle, a platform for data science competitions and datasets.

AudioMachine LearningFeature ExtractionAudio Embeddings+1

0 views

Multimodal & LLM

SORDI Syn2Real: Vision-Language Model Benchmark

A benchmark dataset for evaluating vision-language models, likely focusing on synthetic-to-real transfer. The dataset is hosted on Kaggle and is tagged as a benchmark. Specific details regarding size, columns, and creation date are unknown.

MultimodalVision LanguageBenchmarkComputer VisionSyn2real+1

0 views

Multimodal & LLM

Chest X-Ray Images and Associated Medical Reports

A multimodal dataset containing chest X-ray (CXR) images and their corresponding textual radiology reports, intended for training vision-language models. The specific volume of image-report pairs, creation date, and original author are not specified in the provided metadata. It is identified as a processed subset, version 2, sourced from Kaggle.

MultimodalVision Language ModelMedical ImagingRadiology Reports+1

0 views

Multimodal & LLM

Chest X-Ray Images with Corresponding Radiology Reports

Chest X-ray images and their associated radiology reports are provided for training vision-language models. The dataset is a processed subset, indicating curation for machine learning tasks. The creator and specific volume of data are not specified.

MultimodalVision Language ModelMedical ImagingMedical ReportsRadiology+1

0 views

Multimodal & LLM

PulseMind: MediScope Multimodal Medical Dataset

1,000 curated multimodal medical cases featuring paired medical images and structured JSON annotations. The data is formatted to support vision-language understanding and medical question-answering tasks through its integrated image-text architecture.

ParquetSize Categories1 Kn10 KLibrarypolarsLanguagezhLibrarydaskModalitytextLibrarymlcroissantLibrarydatasetsRegionusLicensemitMedical+1

0 views

Multimodal & LLM

WangVQA Visual Question Answering Dataset

WangVQA is a dataset for visual question answering tasks, likely containing paired images and textual questions with answers. The dataset's creator and specific size are not documented in the provided metadata. Its release date and update frequency are also unknown.

MultimodalMultimodal LearningImage TextComputer VisionNatural Language ProcessingVisual Question Answering+1

0 views

Multimodal & LLM

Octobench

OctoBench is an instruction-following benchmark for coding agents, created by MiniMaxAI and released in January 2026. It is an extended version of OctoCodingBench, expanded from 72 manually annotated instances to 217 instances using AI-assisted augmentation. The dataset is hosted on Hugging Face and is intended for evaluating agentic coding performance.

TextTask Categoriestext GenerationLanguageenArxiv260110343Size Categoriesn1 KAgent EvaluationCodeEvaluationAi BenchmarkBenchmarkCode GenerationRegionusAgentLicensemit+1

0 views

Multimodal & LLM

WavLM-Base: A Pre-Trained Speech Representation Model

WavLM-Base is a pre-trained model for speech representation learning. It was published on the Kaggle platform, but detailed information about its training data, architecture specifics, and performance benchmarks is not provided in the available metadata. The dataset likely contains the model weights and configuration files necessary for inference or fine-tuning.

AudioMachine LearningPre Trained ModelSpeech ProcessingAudio Representation+1

0 views

Multimodal & LLM

RSVLM-SFT: Instruction-Tuning Pairs for Remote Sensing Vision-Language Models

RSVLM-SFT is a remote sensing instruction-tuning dataset released by FelixKAI in 2026 for training the MF-RSVLM vision-language model. It contains image-text pairs for modality alignment and instruction following, although the specific record count is not disclosed in the metadata.

Arxiv251224022RegionusLicenseapache 20+1

0 views

Multimodal & LLM

RSVLM SFT: Remote Sensing Vision-Language Model Training Data

MF-RSVLM is a remote sensing vision-language model (VLM) combining a CLIP vision encoder and a Vicuna-7B language model. The model was trained in two stages for modality alignment and instruction following. The dataset is associated with the FUSE-RSVLM project and was uploaded by RL-MIND.

GeospatialMultimodalVision Language ModelArxiv251224022Satellite ImageryModalityimageComputer VisionRegionusLicenseapache 20+1

0 views

Multimodal & LLM

Data Management Plan for Multimodal AI Health Research

Presenting a Data Management and Sharing Plan outlining the strategy for handling scientific data generated for a research project on ethical, multimodal AI in health. The plan describes the types of data to be used and the framework for its management and sharing. Specific details on data volume, structure, and features are not provided.

0 views

Multimodal & LLM

Data Management Plan for Trustworthy Scientific Foundation Models

Presenting a Data Management and Sharing Plan (DMS Plan) authored by Tianlong Chen, outlining the strategy for managing and sharing scientific data generated for research on trustworthy, domain-informed scientific foundation models. The plan describes the scientific data to be used and generated but does not contain the actual dataset. Specific details on data volume, structure, and features are not provided.

0 views

Multimodal & LLM

NEXUS: Temporal Hierarchical Multimodal Video Slices for Streaming Training

NEXUS is a multi-modal, hierarchical, temporal representation derived from the HuggingFaceFV/finevideo dataset. The primary unit is a 10-millisecond 'slice' that aggregates into moments (100 ms), seconds (1 s), experiences (10 s), and minutes (60 s). It was created by Ardea and last updated on 2025-12-29.

Time SeriesVideoMultimodalNeural EvolutionVideo StreamingTemporal Hierarchical+1

0 views

Multimodal & LLM

TimeLens-100K: A Large-Scale Video Temporal Grounding Dataset

TimeLens-100K is a large-scale training dataset for video temporal grounding, created by TencentARC. The dataset was proposed in the paper 'TimeLens: Rethinking Video Temporal Grounding with Multimodal LLMs' and annotated using an automated pipeline powered by Gemini-2.5-Pro. It was last updated on December 19, 2025.

Time SeriesVideoMultimodalVideo AnnotationMultimodal LlmComputer VisionLarge ScaleVideo Temporal Grounding+1

0 views

Multimodal & LLM

Cultural Relics with Images, 3D Models, and Text Data

A multimodal collection contains images and 3D representations of cultural relics paired with textual descriptions. The dataset's creator, size, and update date are not specified. It integrates visual and text data for analysis.

MultimodalArt HistoryComputer VisionCultural Heritage+1

0 views

Multimodal & LLM

BLIP_Captions: Image Captioning Dataset for Vision-Language Models

A dataset likely containing images paired with descriptive text captions, sourced from Kaggle. The dataset's title suggests it is related to the BLIP (Bootstrapping Language-Image Pre-training) model, a vision-language framework. Specific details on volume, creation date, and authorship are unavailable from the provided metadata.

MultimodalMultimodal AiComputer VisionImage Captioning+1

0 views

Multimodal & LLM

A Multimodal Approach Dataset from Kaggle

Kaggle hosts a dataset titled 'A Multimodal Approach'. The dataset's specific content, size, and creator are not detailed in the provided metadata. Its title suggests it likely contains data from multiple modalities, such as text, images, or audio, integrated for analysis.

MultimodalMachine LearningMultimodal DataAi Research+1

0 views

Multimodal & LLM

A Multimodal Approach Dataset from Kaggle

A Multimodal Approach is a dataset hosted on Kaggle. Its specific content, size, and origin are not detailed in the provided metadata. The dataset likely contains multiple data types, such as text, images, or audio, aligned for multimodal machine learning tasks.

MultimodalMachine LearningAi Research+1

0 views

Multimodal & LLM

Forensic-RS-VQA: Visual Question Answering for Forensic Analysis

Forensic-RS-VQA is a dataset published on Kaggle for forensic analysis using visual question answering. The dataset likely contains multimodal data, such as images paired with textual questions and answers, for reasoning tasks. Specific details on volume, authorship, and update history are not provided in the available metadata.

MultimodalForensic AnalysisVisual Question Answering+1

0 views

Multimodal & LLM

Hotel Customer Preference Data for Tourism Competitiveness

Customer insights for hotel competitiveness are provided in this dataset. The data appears focused on tourism preferences and hotel selection. The author, organization, and specific data volume are unknown.

TabularTourismCustomer InsightsHospitalityHotel Preference+1

0 views

PreviousPage 55 of 98Next