DataSalon

Discover quality datasets for AI training — aggregated from 40+ platforms, curated by AI.

ProductSearch Datasets Browse Topics Rankings Community API / MCP

ResourcesDocumentation Blog Changelog Status

LegalPrivacy Policy Terms of Service Cookie Policy

Multimodal & LLM Datasets | DataSalon

All Categories

🔗

Multimodal & LLM

Image-text pairs, instruction tuning, visual QA, cross-modal data, foundation model training data

1,941 datasets

Multimodal & LLM

CiQi-VQA: Visual Question Answering on Chinese Porcelain Artifacts

A multimodal dataset for cultural reasoning on antique Chinese porcelains, created by SII-Monument-Valley. The dataset is part of the CiQi-Agent project, which aligns visual perception, tool-augmented reasoning, and cultural knowledge. It was last updated on April 1, 2026.

MultimodalSize Categories10 Kn100 KTask Categoriesimage Text To TextLanguagezhTask Categoriesquestion AnsweringArxiv260328474LanguageenModalityimageBenchmarkComputer VisionLicensecc By Nc 40RegionusChinese ArtArtAgentMultimodal VqaCultural HeritageDomain AgentVisual Reasoning+1

0 views

Multimodal & LLM

Leak-CURBER: A Leakage-Controlled Benchmark for Enzymatic Reaction Tasks

Leak-CURBER is a dataset and code package created for the NeurIPS 2026 Evaluations and Datasets track. It likely contains multimodal data for evaluating tasks related to enzymatic reactions. The dataset was uploaded by an anonymous author on May 7, 2026.

MultimodalNeuripsMachine LearningBenchmarkMultimodal EvaluationEnzyme Reactions+1

0 views

Multimodal & LLM

Multimodal Fusion Models for Violence Detection

A dataset likely containing multimodal data for training and evaluating models that detect violent behavior. It is hosted on Kaggle, but specific details about its size, creation date, and authorship are not provided. The content and structure must be verified after download.

MultimodalMachine LearningComputer VisionMultimodal FusionViolence Detection+1

0 views

Multimodal & LLM

VVSim: Large-Scale Aerial-Ground Cooperative Perception Dataset

61,000 fully annotated frames collected for aerial-ground cooperative perception. The dataset integrates synchronized multimodal sensing data and state information from vehicles and UAVs, covering 19 interaction scenarios and 5 weather conditions. It was created by LOTEAT and last updated on Hugging Face in April 2026.

MultimodalTask Categoriesobject DetectionAerial Ground Cooperative PerceptionTask Idssemantic SegmentationLicensecc By 40Task Categoriesimage SegmentationMultimodal SensingTask CategoriesotherRegionusLarge ScaleTrajectory PredictionVehicle DetectionTask Idsvehicle DetectionAutonomous DrivingAerial Ground Cooperation+1

0 views

Multimodal & LLM

Multimodal Closedset Unified: A Unified Dataset for Multimodal Tasks

A multimodal dataset titled 'multimodal_closedset_unified' is hosted on Kaggle. The dataset's specific content, size, and structure are not described in the available metadata. Its author, organization, and last update date are unknown.

MultimodalMachine LearningClosed SetComputer VisionUnified+1

0 views

Multimodal & LLM

Tulu3 Instruction Following SFT 16K Bucket

A dataset titled 'Tulu3 Instruction Following SFT 16K Bucket' is hosted on Kaggle. The title suggests it is likely a collection of instruction-response pairs for supervised fine-tuning of language models. The specific content, size, and creation details are not provided in the available metadata.

TextLanguage ModelAi Training DataInstruction FollowingSupervised Fine Tuning+1

0 views

Multimodal & LLM

vlmclip1: Vision-Language Model Training Data

The dataset titled 'vlmclip1' is hosted on Kaggle. Its name suggests a connection to vision-language models, likely containing data for training or evaluating systems like CLIP. The specific content, size, and structure require verification after download.

VideoMultimodalVision Language ModelMultimodal Ai+1

0 views

Multimodal & LLM

Handvqa

HandVQA is a dataset introduced in a CVPR 2026 paper by researchers from UNIST, University of Aberdeen, University College London, and Fogsphere. It is designed for diagnosing and improving fine-grained spatial reasoning about hands in vision-language models. The dataset page is hosted on Hugging Face by author kcsayem and was last updated on March 30, 2026.

MultimodalSpatial ReasoningHand PerceptionVision Language ModelsComputer VisionVqa Benchmark+1

0 views

Multimodal & LLM

Nexus V11: Multimodal Trading Data for Bollinger Bands Strategy

A multimodal dataset designed for training Vision-Language Models to identify trading exhaustion and opportunities. It was created by author SpaceGhost using a Hindsight Mining technique to capture decision snapshots. The dataset was last updated on HuggingFace on 2026-04-10.

MultimodalTradingVision Language ModelsTechnical AnalysisComputer VisionFinance+1

0 views

Multimodal & LLM

Vero Visual Reasoning Dataset for Multimodal AI Training

Vero-600k is a collection of data for training and evaluating general visual reasoning models, created by researchers at Princeton University's zlab. The dataset supports broad multimodal reasoning tasks across charts, STEM problems, spatial reasoning, and knowledge grounding. It was released in early 2026.

MultimodalStem ReasoningMultimodal AiBenchmarkVisual Reasoning+1

0 views

Multimodal & LLM

Dataset-Multimodal: Combined Data Types for AI Models

Dataset-multimodal likely contains multiple data types such as images, text, or audio for training integrated AI systems. Published on Kaggle, its specific content and scale are not detailed in the available metadata. The author, organization, and last update date are unknown.

MultimodalMachine LearningMultimodal Data+1

0 views

Multimodal & LLM

Japanese VLM Benchmark Collection with Human-Refined Annotations

JAMMEval is a curated benchmark collection for evaluating Vision-Language Models on Japanese Visual Question Answering tasks. It refines seven existing Japanese VQA evaluation datasets through two rounds of human annotation to improve reliability. The dataset was created by llm-jp and was last updated in April 2026.

MultimodalBenchmark EvaluationVision Language ModelsBenchmarkComputer VisionJapanese LanguageVisual Question Answering+1

0 views

Multimodal & LLM

SFT Dataset: A Mixture for Instruction Following and Step-by-Step Reasoning

SFT-Dataset is a curated, medium-scale mixture designed to push a base model toward stronger step-by-step reasoning and reliable instruction following. The dataset was created by SeaFill2025 and was last updated on Hugging Face in April 2026. Quantities are chosen to be trainable on modest GPU budgets while keeping signal density high.

TextParquetSize Categories10 Kn100 KTask Categoriestext GenerationLicenseotherLibrarypolarsLanguageenMath ReasoningModalitytextCodeChain Of ThoughtLibrarymlcroissantLibrarydatasetsLibrarypandasCode GenerationSftRegionusReasoningScienceMathInstruction FollowingSupervised Fine Tuning+1

0 views

Multimodal & LLM

Large-Scale Egocentric Multimodal Dataset for Embodied AI

Xperience-10M is a large-scale egocentric multimodal dataset of human experience created by ropedia-ai. It is designed for research in embodied AI, robotics, and world models. The dataset was last updated on March 20, 2026.

Point CloudMultimodalLicenseotherModality3dTask Categoriesimage To TextSize Categories1 Mn10 MLanguageenTask CategoriesroboticsEgocentricHuman MotionMultimodal DataRoboticsFirst PersonModalityvideoMocapRegionusLarge ScaleEgocentric VisionTask Categoriesdepth Estimation4dEmbodied AiTask Categoriesvideo Classification+1

0 views

Multimodal & LLM

ChartNet RealWorldChart: 30,000 Chart Images with Descriptive Captions

A collection of 30,000 real-world chart images paired with detailed natural-language captions, intended for chart understanding and image-to-text research. The dataset was created by the 2077AIDataFoundation and was last updated on April 3, 2026.

MultimodalMultimodal AiChart ImagesComputer VisionImage Captioning+1

0 views

Multimodal & LLM

WavLM Phase 2 S1: Self-Supervised Speech Representations

WavLM Phase 2 S1 is a dataset hosted on Kaggle, likely containing audio data for self-supervised speech representation learning. The specific content, size, and structure are not detailed in the available metadata. Its origin and creation date are unknown.

AudioSelf Supervised LearningSpeech ProcessingAudio Representation+1

0 views

Multimodal & LLM

INDOTABVQA: Cross-Lingual Table Visual Question Answering for Indonesian Documents

INDOTABVQA is a benchmark dataset for evaluating Vision-Language Models on cross-lingual table understanding in Bahasa Indonesia document images. The dataset was created by NusaBharat and is associated with a paper accepted at ACL 2026 Findings. It was last updated on the Hugging Face platform on April 9, 2026.

MultimodalCross Lingual VqaVision Language ModelsBenchmarkBahasa IndonesiaDocument ImagesComputer Vision+1

0 views

Multimodal & LLM

NGLD Grape Leaf Disease Images for Visual Language Models

The dataset is derived from the Niphad Grape Leaf Disease Dataset (NGLD), which contains high-quality images of table grape leaves categorized by disease. The original dataset was created by researchers from Symbiosis Institute of Technology and published on Mendeley Data under a CC BY 4.0 license. This version, uploaded by qingwuuu, appears to be adapted for use with visual language models.

ImageHealthcareComputer VisionAgriculturePlant Disease+1

0 views

Multimodal & LLM

Doc InfographicVQA: Visual Question Answering on Infographics

Doc InfographicVQA is a dataset hosted on Kaggle. The dataset likely contains infographic images paired with questions and answers to support multimodal AI research. Its specific size, creator, and creation date are not provided in the available metadata.

MultimodalDocument UnderstandingMultimodal AiInfographicsVisual Question Answering+1

0 views

Multimodal & LLM

Doc MP-DocVQA: A Document Visual Question Answering Dataset

Doc MP-DocVQA is a dataset for Visual Question Answering on documents, hosted on Kaggle. The dataset likely contains images of documents paired with questions and answers to test machine comprehension. Specific details on size, creation date, and authorship are not provided in the available metadata.

MultimodalDocument UnderstandingMultimodal QaDocument Vqa+1

0 views

PreviousPage 27 of 97Next