DataSalon

Discover quality datasets for AI training — aggregated from 40+ platforms, curated by AI.

ProductSearch Datasets Browse Topics Rankings Community API / MCP

ResourcesDocumentation Blog Changelog Status

LegalPrivacy Policy Terms of Service Cookie Policy

Multimodal & LLM Datasets | DataSalon

All Categories

🔗

Multimodal & LLM

Image-text pairs, instruction tuning, visual QA, cross-modal data, foundation model training data

1,947 datasets

Multimodal & LLM

Medical VQA Vi: Visual Question Answering for Medical Images

Medical VQA Vi is a dataset for visual question answering in the medical domain, uploaded to HuggingFace by SpringWang08. Its last recorded update was on 2026-04-25 17:31:23. The dataset's specific content, scale, and structure are not detailed in the available metadata.

MultimodalVision LanguageMultimodal LearningHealthcareMedical VqaHealthcare Ai+1

0 views

Multimodal & LLM

Outfit-Level Virtual Try-On Dataset with Multi-Reference Garments

80,000 outfit pairs link multiple reference garment images to a model wearing the complete look. ArtmeScienceLab created this dataset for high-fidelity virtual try-on research, with a test set released in March 2026. Each pair includes 3 to 12 reference images, averaging 4.48 items per outfit.

MultimodalWEBDATASETSize Categories10 Kn100 KLibrarywebdatasetTask Categoriestext To ImageFashionModalitytextLibrarymlcroissantArxiv260314153ModalityimageLibrarydatasetsComputer VisionVirtual Try OnRegionusLarge ScaleTask Categoriesimage To ImageArtLicenseapache 20Task Categoriesimage Text To Image+1

0 views

Multimodal & LLM

S2 Ai4Lcc Precomputed Clay: Sentinel-2 Land Cover Data for Foundation Models

A derived version of the Sentinel-2 Land Cover Dataset, precomputed and reformatted for direct use with the Clay foundation model for Earth observation. The dataset was prepared by author wtr001 and last updated on March 17, 2026. It is designed to bypass typical preprocessing steps like tiling during data loading for training or inference pipelines.

GeospatialSize Categories10 Kn100 KLanguageenZarrSatellite ImageryImage SegmentationLicensecc By 40Task Categoriesimage SegmentationModalitygeospatialEarth ObservationRegionusLand CoverSentinel 2+1

0 views

Multimodal & LLM

Vehicle Diagnostic Logs Sample With Fault Codes And Metrics

A sample subset of 500 structured vehicle diagnostic logs covers multiple subsystems including transmissions, battery systems, brakes, and engines. Each log contains parameters like fault codes, performance metrics, measurements, and maintenance recommendations. The dataset was created by CJJones and was last updated in March 2026.

TabularTime SeriesAutomotive EngineeringVehicle DiagnosticsMaintenance LogsSynthetic+1

0 views

Multimodal & LLM

Vehicle Diagnostic Logs Sample For LLM Training

A 500-example subset of structured vehicle diagnostic logs was created by CJJones and last updated in March 2026. It contains logs for vehicle types and subsystems like transmissions, battery systems, brakes, and engines. Each entry includes parameters such as fault codes, performance metrics, measurements, temporal trends, and maintenance recommendations.

TabularTime SeriesAutomotive EngineeringVehicle DiagnosticsFault DetectionSyntheticMaintenance Log+1

0 views

Multimodal & LLM

Trendyol Cybersecurity Instruction Tuning Dataset: 53,202 Examples for AI Assistants

53,202 instruction-tuning examples covering over 200 specialized cybersecurity domains, built by the Trendyol Security Team. The dataset is designed for training defensive security AI assistants and includes modern challenges like cloud-native threats and AI/ML security. It was last updated on March 8, 2026.

TextCybersecurityLarge Language ModelsDefensive Security+1

0 views

Multimodal & LLM

OneMillion-Bench: Bilingual Expert-Level Language Agent Benchmark

Onemillion Bench is a bilingual (English/Chinese) expert-level benchmark containing 400 entries across five professional domains, released by humanlaya-data-lab in March 2026. It utilizes weighted rubric-based grading criteria to evaluate language agents on analytical reasoning and instruction following within specialized fields.

Task Categoriestext GenerationIndustryLanguagezhTask Categoriesquestion AnsweringLanguageenNatural ScienceSize Categoriesn1 KModalitytextLawEconomics And FinanceRegionusArxiv260307980Licenseapache 20+1

0 views

Multimodal & LLM

VQAv2: Visual Question Answering Dataset

VQAv2FullDataset is a dataset for visual question answering tasks, hosted on Kaggle. The dataset likely contains pairs of images and questions with corresponding answers. Metadata is minimal; the exact scale, content, and collection details require verification after download.

MultimodalMultimodal AiComputer VisionNatural Language ProcessingVisual Question Answering+1

0 views

Multimodal & LLM

RVMS-Bench: 10,000 Bilingual Video Search and Localization Queries

RVMS-Bench is a benchmark for real-world video search and moment localization developed by Tencent in 2026. It contains between 1,000 and 10,000 text-based query annotations and metadata designed for agent-based retrieval frameworks. This specific repository provides the search paradigm metadata but excludes raw video assets and ground-truth keyframes.

JSONSize Categories1 Kn10 KLibrarypolarsLanguagezhLanguageenModalitytextLibrarymlcroissantTask Categoriestable Question AnsweringLibrarydatasetsLibrarypandasRegionusAgentLicenseapache 20Arxiv260210159+1

0 views

Multimodal & LLM

Random In-The-Wild Images for Vision Language Model Testing

A collection of random test images for evaluating vision-language models in diverse, unconstrained scenarios. The dataset was created by author 'merve' and was last updated in April 2026.

MultimodalIMAGEFOLDERVision Language ModelsSize Categoriesn1 KLibrarymlcroissantModalityimageLibrarydatasetsComputer VisionModalityvideoRegionusTest ImagesLicenseapache 20Multimodal Evaluation+1

0 views

Multimodal & LLM

Multimodal Video Highlight Dataset for Summarization

TripleSumm-Mr.HiSum reconstructs the original MR.HiSum dataset by crawling source videos to provide aligned visual, audio, and text features. The dataset supports multimodal research for video highlight detection and summarization. It was created by hminjeong and updated in March 2026.

AudioVideoMultimodalSize Categories10 Kn100 KArxiv260301169LanguageenTask CategoriessummarizationHighlight DetectionFeature ExtractionMultimodal LearningLicensecc By 40ModalityvideoSummarizationRegionusVideo Summarization+1

0 views

Multimodal & LLM

Common-O: Multi-Scene Reasoning with 10,000+ Household Objects

Common-O contains between 10,000 and 100,000 image-text pairs designed by Meta researchers in 2026 to evaluate multimodal LLM reasoning. The data is organized into two subsets featuring household objects to test the ability of models to identify common elements across 3 to 16 different scenes.

ParquetSize Categories10 Kn100 KArxiv251103768LibrarypolarsLibrarydaskLanguageenModalitytextLibrarymlcroissantModalityimageLibrarydatasetsRegionusLicensemit+1

0 views

Multimodal & LLM

DeepGen 1.0: Unified Multimodal Data for Reasoning Image Generation

DeepGen 1.0 contains fewer than 1,000 image-text pairs for multimodal generation and editing, released by deepgenteam in March 2026. The data supports five core tasks including reasoning-based generation and text rendering for a 5B parameter model. It is formatted as an imagefolder and licensed under Apache 2.0.

IMAGEFOLDERTask Categoriestext To ImageSize Categoriesn1 KLibrarymlcroissantModalityimageLibrarydatasetsRegionusTask Categoriesimage To ImageLicenseapache 20Arxiv260212205+1

0 views

Multimodal & LLM

Demo Multimodal Sarcasm Dataset for NLP and Computer Vision Tasks

A demonstration dataset for sarcasm detection, likely containing multimodal data such as text and images. It is hosted on Kaggle, but specific details about its size, creation date, and authors are not provided in the metadata. The dataset's content and structure require verification after download.

MultimodalSarcasm DetectionMultimodal DataSentiment AnalysisNatural Language Processing+1

0 views

Multimodal & LLM

Multimodal Crosslingual Instruction Following Benchmark

MCIF is a human-annotated benchmark for evaluating instruction-following across speech, vision, and text modalities in four languages. The dataset was created by FBK-MT and was last updated in February 2026.

AudioMultimodalMultilingualBenchmarkComputer VisionMultimodal BenchmarkSpeech RecognitionCrosslingual Evaluation+1

0 views

Multimodal & LLM

Projeto Aurora IA: Multimodal Artificial Intelligence Data

Projeto Aurora IA is a dataset published on Kaggle. Its title and description suggest it contains multimodal artificial intelligence data, though the specific content, scale, and structure are not detailed. The dataset's author, organization, and last update date are unknown.

MultimodalMultimodal AiArtificial IntelligenceSynthetic+1

0 views

Multimodal & LLM

HardNegativeDiverseVQA: Hard Negative Examples for Visual Question Answering

HardNegativeDiverseVQA is a dataset published on Kaggle. Its title suggests it contains hard negative examples for Visual Question Answering (VQA) tasks. The dataset's specific size, author, and update date are unknown.

MultimodalMachine LearningHard NegativeVisual Question Answering+1

0 views

Multimodal & LLM

MLE-Bench: Multimodal Model Perception Benchmark

The Multi-Level Existence Benchmark (MLE-Bench) is a dataset designed for fine-grained evaluation of multimodal models' perceptual abilities. It assesses 'pure' perception using 4-choice questions about object or scene existence within images. The dataset was created by JunlinHan and was last updated on March 8, 2026.

MultimodalAi EvaluationPerception EvaluationBenchmarkComputer VisionMultimodal Benchmark+1

0 views

Multimodal & LLM

Multimodal-PathVQA-Method-01: Outputs from a Pathology Visual Question Answering Model

Multimodal-PathVQA-Method-01-outputs is a dataset from Kaggle. The title suggests it contains outputs from a method applied to a pathology visual question answering (VQA) task, likely involving images and text. The dataset's specific content, scale, and origin are not detailed in the provided metadata.

MultimodalMedical ImagingMultimodal QaPathologyVisual Question Answering+1

0 views

Multimodal & LLM

Multimodal PathVQA Method-01: Sample Test Images

Sample test images likely associated with a multimodal visual question answering method for pathology. The dataset is hosted on Kaggle, but its scale, creator, and update history are unspecified. Columns and detailed metadata are unknown.

MultimodalMultimodal QaPathologyMedical Vqa+1

0 views

PreviousPage 36 of 97Next