DataSalon

Discover quality datasets for AI training — aggregated from 40+ platforms, curated by AI.

ProductSearch Datasets Browse Topics Rankings Community API / MCP

ResourcesDocumentation Blog Changelog Status

LegalPrivacy Policy Terms of Service Cookie Policy

Multimodal & LLM Datasets | DataSalon

All Categories

🔗

Multimodal & LLM

Image-text pairs, instruction tuning, visual QA, cross-modal data, foundation model training data

1,948 datasets

Multimodal & LLM

KORE-74K Image Recognition and Captioning Dataset

KORE-74K is a multimodal dataset containing over 74,000 training entries for image recognition, captioning, and visual question answering tasks. It was created by author kailinjiang and published in 2026, building upon the MMEVOKE dataset. The data includes separate archives for recognition/caption images and VQA images, paired with structured JSON annotations.

MultimodalTask Categoriesimage Text To TextMultimodal AiArxiv251019316Computer VisionImage CaptioningRegionusRecognition DataVisual Question Answering+1

0 views

Multimodal & LLM

ChartVerse-RL-40K: Challenging Chart Reasoning Samples for Reinforcement Learning

ChartVerse-RL-40K is a curated dataset of the most challenging chart reasoning samples for Reinforcement Learning, developed by opendatalab. It contains samples with the highest failure rates, which strong Vision-Language Models struggle with but can still solve occasionally, providing a strong learning signal for RL training. The dataset was last updated on 2026-01-21.

MultimodalDifficult SamplesVision Language ModelsChart ReasoningReinforcement Learning+1

0 views

Multimodal & LLM

Unified Prompt Guard: 287,303 Samples for LLM Jailbreak and Harmful Input Detection

ynyg's Unified-Prompt-Guard dataset, last updated January 2026, is a text dataset for training binary classifiers to defend against LLM jailbreak attacks and unsafe prompts. It contains 265,589 training, 10,857 validation, and 10,857 test samples, synthesized from three high-quality sources including jailbreak-detection-dataset, Nemotron-Safety-Guard-Dataset-v3 (zh), and PKU-SafeRLHF.

TextPrompt SafetyText ClassificationLlm SecurityHarmful Content+1

0 views

Multimodal & LLM

WAVLM_Age(VF): Voice Feature Data for Age Prediction

A dataset titled 'WAVLM_Age(VF)' hosted on Kaggle. The title suggests it contains voice features likely extracted using the WAVLM model for the purpose of age prediction or analysis. No further metadata, such as sample count, file formats, or author details, is provided.

TabularAudioAudio ClassificationVoice FeaturesSpeech ProcessingAge Prediction+1

0 views

Multimodal & LLM

RadioML Optimized Dataset with Multimodal Features

RadioML Optimized Multimodal Dataset is a processed version of the RadioML dataset, stored in Zarr format. The dataset appears to be optimized for machine learning workflows and includes multimodal features. The original author, organization, and specific data volume are not provided in the available metadata.

MultimodalMachine LearningRadio SignalSignal ProcessingWireless Communication+1

0 views

Multimodal & LLM

VinDr-CXR-VQA: Chest X-Ray Visual Question Answering with Spatial Grounding

VinDr-CXR-VQA is a large-scale dataset combining 4,394 chest X-ray images with 17,597 natural language question-answer pairs. The dataset, created by faizan711 and last updated in January 2026, is designed for explainable medical AI and includes spatial grounding annotations and clinical reasoning explanations. It features six distinct question types to facilitate research in medical visual question answering.

MultimodalMedical ImagingChest X RayHealthcareLarge ScaleNatural Language ProcessingVisual Question AnsweringExplainable Ai+1

0 views

Multimodal & LLM

VIFoodVQA: Visual Question Answering Dataset for Food Images

vifoodvqa is a dataset published on Kaggle. The title suggests it is a Visual Question Answering (VQA) dataset focused on food images. The dataset's specific content, size, and origin require verification after download due to minimal provided metadata.

MultimodalComputer VisionNatural Language ProcessingFoodVisual Question Answering+1

0 views

Multimodal & LLM

Multimodal Cardiovascular Risk Dataset with ECG Images and Clinical Features

A multimodal dataset for cardiovascular risk prediction, sourced from Kaggle. It combines ECG images with tabular clinical and biomarker data. The author, organization, size, and temporal coverage are unspecified.

MultimodalEcg ImagesBiomarkersHealth PredictionHealthcareCardiovascular RiskClinical Features+1

0 views

Multimodal & LLM

RelNorm Results: Testing Social Norm Understanding in Multimodal Models

RelNorm Results is a dataset from Kaggle focused on evaluating the understanding of social norms in multimodal AI models. The dataset likely contains test results and performance metrics from experiments assessing how models interpret social contexts across different modalities. The author, organization, and specific data scale are not provided in the input.

MultimodalModel EvaluationMultimodal AiSocial Norms+1

0 views

Multimodal & LLM

VQAdataset2: Visual Question Answering Dataset

VQAdataset2 is a dataset for visual question answering tasks, published on Kaggle. The dataset likely contains paired images and text questions with corresponding answers. Specific details on size, columns, and creation are not provided in the metadata.

MultimodalMultimodal AiComputer VisionNatural Language ProcessingVisual Question Answering+1

0 views

Multimodal & LLM

MultiModal Heart Disease Dataset

MultiModal Heart Disease Dataset is a dataset published on Kaggle. Its title suggests it likely contains data related to cardiovascular health, potentially combining different data types. Metadata is minimal; actual content requires verification after download.

MultimodalHeart DiseaseHealthcareMedical+1

0 views

Multimodal & LLM

Hindi-VLM-Files-v2: Hindi Vision-Language Model Training Data

A dataset likely containing files for training or evaluating Vision-Language Models (VLMs) for the Hindi language. It is published on the Kaggle platform. The specific content, scale, and creation details are not provided in the available metadata.

MultimodalHindi LanguageVision Language ModelMultimodal Data+1

0 views

Multimodal & LLM

hdrcde: Highest Density Regions and Conditional Density Estimation

hdrcde is a dataset for computational statistics, focusing on highest density regions and conditional density estimation. It was authored by Rob J. Hyndman and is hosted on the paperswithcode platform. The dataset's specific size, temporal coverage, and geographic scope are not detailed in the provided metadata.

TabularEstimationComputer ScienceMathematicsMultimodal RegressionEconomicsKernel MethodsGeographyPhysicsStatisticsDensity EstimationStatistical Physics+1

0 views

Multimodal & LLM

BMP-VLM-2: A Vision-Language Model Dataset

Kaggle hosts the BMP-VLM-2 dataset. The title suggests it contains data for training or evaluating vision-language models, which combine image and text understanding. Specific details regarding its size, creation date, and authorship are not provided in the available metadata.

MultimodalVision Language ModelMultimodal AiComputer VisionNatural Language Processing+1

0 views

Multimodal & LLM

Large-Scale Multimodal Chemical Structure Images

MolParse v1.0 is a multimodal dataset released in January 2026 for optical chemical structure parsing. It contains a large-scale collection of molecular structure images sourced from scientific literature, designed to train models that convert diagrams into structured chemical representations.

MultimodalOPTIMIZED-PARQUETParquetSize Categories1 Kn10 KLibrarypolarsLanguageenTask Categoriesvisual Question AnsweringModalitytextLibrarymlcroissantArxiv260119325ModalityimageLibrarydatasetsLibrarypandasRegionusLicensemit+1

0 views

Multimodal & LLM

LLaVA Dataset 00012: A Multimodal AI Collection

A dataset from the LLaVA (Large Language-and-Vision Assistant) project, likely containing multimodal data for training or evaluating vision-language models. The dataset is hosted on Kaggle, but its specific contents, size, and creation details are not provided in the metadata. Further details about the data's origin, collection method, and temporal coverage are unknown.

MultimodalVision LanguageLlavaMultimodal Ai+1

0 views

Multimodal & LLM

Keysay VLM Context Training Image-Text Pairs

Keysay VLM Context Training is a multimodal dataset for vision-language model development, curated by Enriqueag26. It contains image-text pairs, as indicated by its platform tags for image and text modalities, and was last updated in March 2026.

MultimodalOPTIMIZED-PARQUETParquetImage Text PairsLibrarypolarsVision Language ModelContext TrainingSize Categoriesn1 KModalitytextLibrarymlcroissantMultimodal AiModalityimageLibrarydatasetsLibrarypandasRegionus+1

0 views

Multimodal & LLM

BLIP: Image Captioning Data

A dataset likely containing images paired with textual captions, inferred from the title 'blip_captions_data'. It is hosted on Kaggle, but detailed metadata such as size, source, and creation date is unavailable. The content and structure require verification after download.

MultimodalMultimodal AiComputer VisionImage Captioning+1

0 views

Multimodal & LLM

SynVQA-UITAIC: Synthetic Visual Question Answering Benchmark

SynVQA-UITAIC is a dataset hosted on Kaggle. The title suggests it is likely a benchmark dataset for evaluating Visual Question Answering (VQA) systems, possibly containing synthetic or generated visual and textual content. Its specific contents, size, and authorship are unknown from the provided metadata.

MultimodalAi EvaluationBenchmarkVisual Question Answering+1

0 views

Multimodal & LLM

Videomind Dataset

VideoMind-SFT contains 481,000 video-annotation pairs and a 210,000-record Grounder subset released by yeliudev in early 2026. The collection provides videos in both original formats and compressed versions at 3 FPS and 480p resolution without audio for efficient model training.

Arxiv250313444Regionus+1

0 views

PreviousPage 49 of 98Next