DataSalon

Discover quality datasets for AI training — aggregated from 40+ platforms, curated by AI.

ProductSearch Datasets Browse Topics Rankings Community API / MCP

ResourcesDocumentation Blog Changelog Status

LegalPrivacy Policy Terms of Service Cookie Policy

Multimodal & LLM Datasets | DataSalon

All Categories

🔗

Multimodal & LLM

Image-text pairs, instruction tuning, visual QA, cross-modal data, foundation model training data

1,956 datasets

Multimodal & LLM

i-CIR: Instance-Level Composed Image Retrieval with 100K+ Hard Negatives

i-CIR is a benchmark for instance-level composed image retrieval containing between 100,000 and 1,000,000 records, released by billpsomas in 2024. It facilitates the retrieval of specific, visually indistinguishable objects by combining a reference image with a text-based modification query. The dataset includes a specialized database of visual, textual, and compositional hard negatives to test model precision.

WEBDATASETLanguageenLibrarywebdatasetLicensecc By Nc Sa 40ModalitytextSize Categories100 Kn1 MLibrarymlcroissantModalityimageLibrarydatasetsRegionusTask Categoriesimage Text To ImageArxiv251025387+1

0 views

Multimodal & LLM

H&M Fashion Product Metadata with Precomputed Embeddings and Image URLs

A processed and enhanced version of the H&M Personalized Fashion Recommendations Kaggle competition dataset. The dataset has been cleaned and augmented with pre-computed embeddings and accessible image URLs by Qdrant, last updated in December 2025.

MultimodalParquetTask Categoriesimage Feature ExtractionLibrarypolarsTask Categoriesimage To TextRecommendationE CommerceFashionModalitytextSize Categories100 Kn1 MModalitytabularLibrarymlcroissantLibrarydatasetsLibrarypandasLicensecc By 40Task Categoriesimage ClassificationComputer VisionRegionusTask Categoriestext ClassificationRetailEmbeddings+1

0 views

Multimodal & LLM

Surveillance VLM Weapon and Knife Detection Dataset for Instruction Tuning

Simuletic's Surveillance VLM Weapon Knife Detection Dataset is an open-source subset of the Simuletic Safety VLM Dataset. It is designed for instruction tuning of Vision Language Models to locate weapons and knives, reason about threats, and avoid false positives. The dataset was last updated on December 17, 2025.

MultimodalIMAGEFOLDERVision Language ModelSize Categoriesn1 KSurveillanceLibrarymlcroissantModalityimageLicensecc By Sa 40LibrarydatasetsComputer VisionRegionusWeapon Detection+1

0 views

Multimodal & LLM

TaiwanVQA: Visual Question Answering Benchmark for Cultural Understanding

TaiwanVQA is a visual question answering benchmark containing 2,736 original images paired with 5,472 manually designed questions. It is designed to evaluate the capability of vision-language models in recognizing and reasoning about culturally specific content related to Taiwan. The dataset was created by author hhhuang and last updated on December 4, 2025.

MultimodalTaiwanVision Language ModelsBenchmarkComputer VisionVisual Question Answering+1

0 views

Multimodal & LLM

AgentVQA: Multi-Domain Visual Question Answering for Agents

AgentVQA is a multi-domain dataset for training and evaluating visual agents. The dataset contains multiple-choice questions based on screenshots, images, and videos across five domains, including GUI interaction and robot manipulation. It was created by AgentVQA and last updated on December 18, 2025.

MultimodalVideo PerceptionSpatial ReasoningMultimodal AiGui InteractionVisual Question Answering+1

0 views

Multimodal & LLM

Safety Multimodal Jailbreaking Dataset for AI Model Testing

Safety Multimodal Jailbreaking is a dataset hosted on HuggingFace by author leeeliu. The dataset was last updated on 2026-01-28, suggesting ongoing maintenance. Its title indicates it likely contains examples for testing or bypassing safety measures in multimodal AI systems.

MultimodalJailbreakingAi SafetyMultimodal AiAdversarial Examples+1

0 views

Multimodal & LLM

Flux 2 Pro T2I Human Preference: Over 400,000 Annotations from 50,000 Annotators

Over 400,000 human preference responses for evaluating the Flux 2 Pro text-to-image model, collected in less than seven hours via the Rapidata Python API. The dataset was created by Rapidata and last updated on December 2, 2025. It includes evaluations across preference, coherence, and alignment categories.

TabularHuman PreferenceModel EvaluationBenchmarkText To ImageLarge Scale+1

0 views

Multimodal & LLM

LAION Subset: 24,840 Image-Caption Pairs for LCM-LoRA Training

Mercity's LAION Subset is a curated collection of 24,840 high-quality image-caption pairs from the LAION-5B dataset, formatted at 512x512 resolution. It is designed for training Latent Consistency Model (LCM) LoRA adapters on Stable Diffusion v1.5. The dataset, last updated in November 2025, occupies approximately 4.16 GB and is stored in Parquet files.

MultimodalGenerative AiImage Caption PairsLora TrainingStable DiffusionComputer Vision+1

0 views

Multimodal & LLM

Multimodal Physiological Stress Data from College Students

Multimodal Physiological Stress Dataset is a collection of dynamic stress data from college students, published on Kaggle. The dataset likely contains time-series physiological measurements, though specific columns and sample sizes are not detailed in the provided metadata. Its raw description indicates a focus on student stress levels, but the exact collection methodology and temporal coverage are unknown.

Time SeriesMultimodalStudent HealthMultimodal DataPhysiological Stress+1

0 views

Multimodal & LLM

Multicultural Visual Question Answering Benchmark for VLMs

Vision-Language Models are Confused Tourists evaluates the cultural robustness of VLMs, a largely untested dimension crucial for supporting diverse societies. The dataset was created by author patrickamadeus and was last updated in December 2025. It contains image-text pairs designed to test model stability across diverse cultural inputs.

MultimodalParquetSize Categories1 Kn10 KLibrarypolarsLanguageenTask Categoriesvisual Question AnsweringVision Language ModelsModalitytextLibrarymlcroissantFactualArxiv251117004ModalityimageLibrarydatasetsLibrarypandasComputer VisionMulticultural EvaluationRegionusRobustnessVqaLicenseapache 20Model RobustnessVisual Question AnsweringMulticultural+1

0 views

Multimodal & LLM

Image Caption Dataset for Vision-Language Models

Image caption data likely contains pairs of images and descriptive text. The dataset is hosted on Kaggle, a platform for data science competitions and projects. Specific details on volume, creation method, and update recency are not provided in the metadata.

MultimodalComputer VisionImage Captioning+1

0 views

Multimodal & LLM

Chart VQA: Visual Question Answering on Charts and Graphs

Chart VQA likely contains images of charts and graphs paired with natural language questions and answers. The dataset is hosted on Kaggle, a platform for data science competitions and projects. Specific details on volume, creation date, and authorship are not provided in the available metadata.

MultimodalChart AnalysisMultimodal AiVisual Question Answering+1

0 views

Multimodal & LLM

College Student Career Preferences with Psychological and IoT Behavioral Indicators

A dataset from Kaggle focusing on college students' career preferences. The raw description suggests it includes psychological and IoT behavioral indicators. The specific scale, collection method, and temporal coverage are not detailed in the provided metadata.

TabularCareer PreferenceStudent PsychologyEducation DataBehavioral Indicators+1

0 views

Multimodal & LLM

Lidar Multimodal Data Collection

Lidar_multimodal likely contains data from Light Detection and Ranging (LiDAR) sensors combined with other modalities. The dataset is hosted on Kaggle, but its specific content, size, and creation details are not provided. Columns and sample data are unknown.

GeospatialPoint CloudMultimodal+1

0 views

Multimodal & LLM

Multimodal Skin Lesion Data

Kaggle hosts a dataset titled 'Multimodal_skin_lesion'. The dataset likely contains data related to skin lesions, possibly including images and other data types. The author, organization, and specific details are unknown.

MultimodalMedical ImagingMultimodal DataSkin Lesion+1

0 views

Multimodal & LLM

Multimodal Skin Lesion Project Code

Multimodal-skin-lesion-project-code is a dataset published on Kaggle. The dataset likely contains code and data related to skin lesion analysis. Its specific content, size, and authorship require verification after download.

MultimodalMedical ImagingMultimodal DataComputer VisionSkin Lesion+1

0 views

Multimodal & LLM

Viet-Chart-VQA: Vietnamese Visual Question Answering Images

Viet-Chart-VQA-images is a dataset hosted on Kaggle. The title suggests it contains images paired with questions and answers, likely for training or evaluating Visual Question Answering models. The dataset's content, scale, and provenance require verification after download.

ImageMultimodalVietnameseVisual Question Answering+1

0 views

Multimodal & LLM

Instruction Preference Dataset for Large Language Model Alignment

A dataset for aligning large language models with human preferences. The dataset is hosted on Kaggle, but its specific size, authorship, and creation date are not provided in the metadata. The content likely contains pairs of instructions and responses with preference rankings.

TextAi SafetyPreference DataInstruction TuningLlm Alignment+1

0 views

Multimodal & LLM

PrivLM Baseline Results: Pre-computed Exposures for GPT-Neo-1.3B

Pre-computed baseline exposures for the GPT-Neo-1.3B language model. The dataset is hosted on Kaggle and appears to contain metrics related to privacy or model behavior. The specific data format, size, and creation details are not provided in the metadata.

TabularGpt NeoBaseline ResultsPrivacyBenchmarkLanguage Model+1

0 views

Multimodal & LLM

AUTOPILOT VQA Heatmaps: 661 Visual Question Answering Files

AUTOPILOT VQA Heatmaps likely contains 661 files related to Visual Question Answering, a task combining computer vision and natural language processing. The dataset appears to focus on heatmap visualizations, which are often used to interpret model attention. It is published on Kaggle, but the author, creation date, and specific content details are not provided in the metadata.

MultimodalHeatmapsComputer VisionAutopilotVqa+1

0 views

PreviousPage 60 of 98Next