DataSalon

Discover quality datasets for AI training — aggregated from 40+ platforms, curated by AI.

ProductSearch Datasets Browse Topics Rankings Community API / MCP

ResourcesDocumentation Blog Changelog Status

LegalPrivacy Policy Terms of Service Cookie Policy

Multimodal & LLM Datasets | DataSalon

All Categories

🔗

Multimodal & LLM

Image-text pairs, instruction tuning, visual QA, cross-modal data, foundation model training data

1,956 datasets

Multimodal & LLM

Face-Iris Synthetic Data for Multimodal Biometric Testing

Synthetic face–iris dataset designed for multimodal biometric research and testing. The dataset's author, size, and specific creation details are not provided. Its last update date and licensing terms are also unknown.

ImageMultimodal🌍 GlobalFace RecognitionFeature ExtractionBiometricsClassificationSynthetic DataSynthetic+1

0 views

Multimodal & LLM

CCTV-Pedestrian-1K: High-Angle Surveillance Images for Person Attribute Recognition

CCTV-Pedestrian-1K is a dataset of high-angle surveillance pedestrian images intended for training Vision Transformers (ViT) and Vision-Language Models (VLM). The dataset is hosted on Kaggle and is tagged for applications in public safety and computer vision. Specific details on the number of images, collection time, and creator are not provided in the available metadata.

ImageMultimodalSurveillanceEyes And VisionComputer VisionTransformersPedestrian DetectionPublic SafetyVision TransformerPerson Attributes+1

0 views

Multimodal & LLM

HeartCycle: Multimodal Cardiac Data with ECG and PPG Signals

Multimodal cardiac data integrates electrocardiogram (ECG), photoplethysmogram (PPG), and cardiac timing features. The dataset is hosted on Kaggle and is associated with platform tags for biology, signal processing, and medicine. Specific details on size, origin, and update frequency are not provided in the available metadata.

MultimodalMedicineEcgPpgBiologyCardiac PhysiologySignal ProcessingHealth+1

0 views

Multimodal & LLM

Multimodal Traditional Chinese Medicine Knowledge Dataset

Aligned text, image, and audio data for cross-language AI translation tasks in Traditional Chinese Medicine (TCM). The dataset is hosted on Kaggle and is tagged as suitable for beginners. Its author, organization, and specific size are unknown.

AudioMultimodalAi TranslationTraditional Chinese MedicineComputer VisionBeginnerCross Language+1

0 views

Multimodal & LLM

MGI-TED: Toddler Multimodal Features for Development Analysis

MGI-TED provides multimodal features for analyzing toddler development and learning behavior. The dataset's author, organization, and specific scale are currently unknown. It is hosted on Kaggle, but details on its collection method and temporal coverage are not provided.

MultimodalChild DevelopmentMultimodal DataLearning Behavior+1

0 views

Multimodal & LLM

Longtimescope: Long-Video Data for Multimodal Model Exploration

Longtimescope is a dataset for exploring long-video understanding with large multimodal models, as referenced in the Apollo2 research paper. The dataset was created by the Apollo-LMMs team and was last updated on the Hugging Face platform in January 2026. Its specific size, format, and content details are not provided in the available metadata.

VideoMultimodalLong VideoVideo UnderstandingLarge Language Models+1

0 views

Multimodal & LLM

Nemotron-VLM-Dataset V2: 9 Million Vision-Language Reasoning Samples

NVIDIA released this collection of approximately 9 million vision-language samples in late 2025. It focuses on document understanding, visual question answering, and video-to-text tasks across multiple languages.

MultilingualJSONTask Categoriesimage Text To TextDocument UnderstandingLibrarypolarsSize Categories1 Mn10 MVision Language ModelTask Categoriesvisual Question AnsweringModalitytextLibrarymlcroissantTask Categoriesvideo Text To TextLibrarydatasetsLibrarypandasLicensecc By 40Arxiv251103929Regionus+1

0 views

Multimodal & LLM

S-Chain: Structured Visual Chain-of-Thought for Multilingual Medicine

S-Chain is a multimodal medical dataset developed by Khai Le-Duc and a multi-institutional research team, last updated in December 2025. It provides structured visual chain-of-thought reasoning paths for clinical tasks across eight languages, including English, Arabic, and Japanese. The data supports a wide range of tasks from object detection to multilingual text generation.

Task Categoriestext GenerationLanguagearTask Categoriesmultiple ChoiceTask Categoriesquestion AnsweringTask Categoriesobject DetectionLanguageenTask Categoriesvisual Question AnsweringLanguagehiLanguageidTask Categorieszero Shot Image ClassificationTask Categoriesfeature ExtractionTask Categoriesimage ClassificationTask Categorieszero Shot ClassificationLanguagekoLanguagefrLanguagejaTask Categoriestext ClassificationTask CategoriestranslationTask Categorieszero Shot Object DetectionLanguagede+1

0 views

Multimodal & LLM

WildfireVLM: A Multimodal Benchmark for Wildfire Analysis

WildfireVLM is a dataset hosted on Kaggle, likely focused on visual and language modeling for wildfire events. The platform tags suggest it contains geospatial and computer vision data, potentially for benchmarking deep learning models. Its specific content, size, and creation details require verification after download.

GeospatialMultimodalComputer ScienceBenchmarkComputer VisionData AnalyticsDeep LearningWildfire+1

0 views

Multimodal & LLM

HAIM Multimodal Full Dataset

HAIM Multimodal Full Dataset is hosted on Kaggle. The dataset's specific content, size, and creation details are not provided in the available metadata. Its title suggests it contains multiple data modalities, likely for machine learning research.

MultimodalMachine LearningAi Research+1

0 views

Multimodal & LLM

VQA Animal Dataset: Visual Question Answering for 90 Animal Species

Image and text question-answer pairs representing 90 distinct animal species. It provides structured data for Visual Question Answering (VQA) tasks, focusing on the identification and description of fauna.

EnglishAnimalsQuestion AnsweringDeep Learning+1

0 views

Multimodal & LLM

Synthetic EHR Dataset with Text and Vital Signs

A synthetic electronic health record dataset integrating text notes and time-series vital sign data. The dataset is designed for healthcare predictive research, specifically HPR. It was created by an unknown author and published on Kaggle, with no information on its size or last update.

Time SeriesMultimodalHealthcareClinical DataSynthetic DataSynthetic+1

0 views

Multimodal & LLM

HADES-VLM-Data: Vision-Language Model Training Dataset

HADES-VLM-Data is a dataset for training vision-language models, published on Kaggle. The dataset's specific content, size, and creation details are not described in the available metadata. Its intended use likely involves aligning visual and textual information for AI model development.

MultimodalVision Language ModelMultimodal DataAi Training+1

0 views

Multimodal & LLM

Student Engagement Multimodal Dataset

A multimodal dataset focused on student engagement, published on Kaggle. The dataset likely contains multiple data types such as video, audio, or sensor readings to capture behavioral and interaction patterns. Specific details on volume, collection method, and authorship are not provided in the available metadata.

MultimodalStudent EngagementBehavioral AnalysisMultimodal DataEducation Research+1

0 views

Multimodal & LLM

Camelyon16 Uni: Patch Embeddings for Histopathology Images

Patch embeddings for the CAMELYON16 dataset generated using the UNI foundation model. The embeddings are derived from 128x128 micrometer tissue patches, with segmentation and patching performed using a modified version of the CLAM toolkit. The dataset was authored by kaczmarj and last updated on December 10, 2025.

ImageMultimodalFoundation ModelMedical ImagingComputer VisionHistopathologyPatch Embeddings+1

0 views

Multimodal & LLM

Myanmar Language Question-Answer Pairs for Instruction Tuning

A collection of question-answer pairs in the Myanmar language designed for instruction tuning of Large Language Models. The dataset aggregates content from multiple sources covering domains like agriculture, health, microbiology, general knowledge, and Buddhism. It was created by chuuhtetnaing and last updated on Hugging Face in December 2025.

TextMyanmar LanguageQuestion AnsweringHealthcareLlm Training+1

0 views

Multimodal & LLM

JMMMU-Pro: Japanese Multimodal Understanding Benchmark via Vibe Benchmark Construction

JMMMU-Pro is an image-based Japanese multi-discipline multimodal understanding benchmark. It extends the JMMMU benchmark by composing question images and text into a single image, requiring integrated visual-textual understanding. The dataset was created by JMMMU and last updated on Hugging Face in December 2025.

MultimodalBenchmarkVisual UnderstandingComputer VisionText UnderstandingJapanese LanguageMultimodal Benchmark+1

0 views

Multimodal & LLM

Spectral Understanding and Visual Question Answering Benchmark

SpecVQA is a benchmark dataset for evaluating Multimodal Large Language Models on spectral understanding and visual question answering tasks using scientific images. The dataset is authored by UniParser and was last updated in December 2025. It contains images and text data, with specific row and column counts unknown.

OPTIMIZED-PARQUETParquetSize Categories1 Kn10 KLibrarypolarsLanguagezhLanguageenTask Categoriesvisual Question AnsweringModalitytextLibrarymlcroissantModalityimageLibrarydatasetsLibrarypandasLicensecc By Nc 40Regionus+1

0 views

Multimodal & LLM

AQI Multimodal Dataset: Air Quality Index Data with Multiple Modalities

The AQI Multimodal Dataset is a collection of data related to air quality, likely containing measurements from various sources. The dataset is hosted on Kaggle, but specific details about its size, origin, and creation date are not provided in the available metadata. Further verification is required to confirm the exact contents, scale, and authorship.

MultimodalAir QualityEnvironmental Sensing+1

0 views

Multimodal & LLM

Multimodal Coding Dataset: 598k Samples for HTML, Chart, and Algorithmic Code Generation

598,000 high-quality samples for training and evaluating multimodal code generation models. The dataset covers HTML generation, chart-to-code, image-augmented QA, and algorithmic problems, supporting research in unifying vision-language understanding with code generation. It was created by author 'lingjie23' and last updated on December 24, —2025.

MultimodalVision LanguageComputer VisionAlgorithmic ProblemsChart To CodeLarge ScaleMultimodal Code Generation+1

0 views

PreviousPage 58 of 98Next