DataSalon

Discover quality datasets for AI training — aggregated from 40+ platforms, curated by AI.

ProductSearch Datasets Browse Topics Rankings Community API / MCP

ResourcesDocumentation Blog Changelog Status

LegalPrivacy Policy Terms of Service Cookie Policy

Multimodal & LLM Datasets | DataSalon

All Categories

🔗

Multimodal & LLM

Image-text pairs, instruction tuning, visual QA, cross-modal data, foundation model training data

1,928 datasets

Multimodal & LLM

Contextual Decoupling in Color Preference: Multimodal Evidence from Spatial Evaluation

A multimodal dataset from a three-stage study examining color preference stability in spatial contexts. The data includes baseline preferences for ten Munsell hues, Preference and Comfort ratings, eye-tracking, and pupillometric data from a simulated makerspace environment, authored by Hourong Yu and last updated in May 2026. The dataset is shared under a CC-BY-4.0 license on figshare.

MultimodalMakerspaceBenchmarkEye TrackingMultimodal ResearchColor PreferenceSpatial EvaluationSynthetic+1

0 views

Multimodal & LLM

OpenDetection-30K-Human-Preferences: Object Detection Annotations from Public Images

OpenDetection-30K-Human-Preferences is an object detection dataset built primarily from general human-preference, publicly available images. The dataset contains object detection annotations generated using automated computer vision pipelines. It was created by prithivMLmods and was last updated on Hugging Face on 2026-07-11.

ImageMultimodalImage AnnotationsComputer VisionObject DetectionHuman PreferencesSynthetic+1

0 views

Multimodal & LLM

WildCity: Real-World Multimodal Street-View Data from U.S. Cities

WildCity is a real-world city-scale multimodal dataset for street-view reconstruction, simulation, and spatial intelligence. It was collected from autonomous-driving fleet logs across multiple U.S. cities and contains surround-view RGB images, LiDAR, calibration, ego and sensor poses, object annotations, semantic masks, and processed reconstruction assets. The dataset was authored by Neptune615 and last updated on 2026-06-30.

Point CloudMultimodalSpatial IntelligenceCity ScaleStreet ViewAutonomous DrivingMultimodal Sensor+1

0 views

Multimodal & LLM

R-KNav: 10,000 Hours of Sidewalk Rover Multimodal Data from 15 US Locations

R-KNav is a multimodal dataset derived from real-world operations of the R-Kiwi sidewalk rover fleet. Robot.com collected 10,000 hours of data across 15 locations in the United States to support the development of AI-driven autonomous robotics. The dataset is intended to catalyze the evolution of robotics foundation models.

MultimodalAutonomous NavigationMultimodal DataRoboticsSidewalk Rover+1

0 views

Multimodal & LLM

MMLSv2: Martian Landslide Detection in Remote Sensing Imagery

MMLSv2 is a multimodal dataset for Martian landslide detection in remote sensing imagery. It is the official dataset for the 1st Mars Landslide Segmentation Challenge (MARS-LS) and was accepted at the 4th Workshop on AI for Space (AI4Space) @ CVPR 2026. The dataset was created by MarsLS and was last updated on July 13, 2026.

ImageGeospatialMultimodalSegmentation ChallengeComputer VisionMultimodal ImageryPlanetary ScienceMartian Landslides+1

0 views

Multimodal & LLM

VeriEvol-SFT: Multimodal STEM Reasoning Problems with Chain-of-Thought Solutions

VeriEvol-SFT is a supervised fine-tuning dataset for scaling multimodal mathematical reasoning via verifiable evolution instructions. The dataset, created by Ringo1110, contains single-image, single-turn visual STEM reasoning problems paired with long chain-of-thought solutions. Prompts are produced by route-specific evolution operators that rewrite problems, as detailed in the associated paper arXiv:2606.23543.

MultimodalChain Of ThoughtComputer VisionStem ProblemsMultimodal ReasoningVerifiable Evol InstructSupervised Fine Tuning+1

0 views

Multimodal & LLM

SpectralGPT: Remote Sensing Foundation Model for Spectral Data

SpectralGPT is the first purpose-built foundation model designed explicitly for spectral remote sensing data. The model considers unique characteristics of spectral data, such as spatial-spectral coupling and spectral sequentiality, within a masked autoencoder framework. The release includes trained models (SpectralGPT, SpectralGPT+), a new benchmark dataset (SegMunich) for semantic segmentation, original code, and implementation instructions.

GeospatialMultimodalFoundation ModelSpectral DataSemantic SegmentationSatellite ImageryBenchmarkComputer Vision+1

0 views

Multimodal & LLM

CARV: A Diagnostic Benchmark for Compositional Analogical Reasoning in Multimodal LLMs

CARV is a diagnostic benchmark created by researchers from Pennsylvania State University to evaluate compositional analogical reasoning in multimodal large language models. It assesses whether models can compose transformation rules from multiple image pairs via logical set operations like union and intersection. The dataset was last updated on July 13, 2026.

MultimodalMultimodal LlmBenchmarkComputer VisionDiagnostic BenchmarkAnalogical ReasoningVisual Reasoning+1

0 views

Multimodal & LLM

NuRisk: Visual Question Answering for Autonomous Driving Risk Assessment

TUM-AVS created the NuRisk dataset for agent-level risk assessment in autonomous driving. It contains visual question-answering pairs based on Bird's-Eye View images. The dataset was last updated on July 9, 2026.

MultimodalRisk assessmentComputer VisionBev ImageAutonomous DrivingVisual Question Answering+1

0 views

Multimodal & LLM

COCO-Mini-DeepCaption-10K: 10,000 Images with Detailed Synthetic Captions

COCO-Mini-DeepCaption-10K is a dense image captioning dataset built from a 10,000-image subset of the COCO dataset. It pairs these images with long-form synthetic captions generated using the Qwen3.5 multimodal model. The dataset was created by prithivMLmods and was last updated on July 6, 2026.

MultimodalMultimodal AiComputer VisionImage CaptioningSynthetic DataSynthetic+1

0 views

Multimodal & LLM

Compar:IA: French-Language Chatbot Conversations and Human Preferences

Compar:IA is a public chatbot arena run by the French Ministry of Culture. The dataset contains side-by-side conversations where users chat with two anonymous models and indicate their preferred answer. Each row represents one turn of a conversation, including the two model answers and the user's preference for that turn.

TabularChatbot EvaluationConversational AiFrench LanguageHuman Preferences+1

0 views

Multimodal & LLM

WildfireVLM: Satellite Imagery for Wildfire and Smoke Detection

3,771 labeled satellite images from Landsat-8 and GOES-16 sources, split into training, validation, and test subsets. The dataset was created by Aydin Ayanzadeh for early wildfire detection and smoke analysis, with images resized to 416 × 416 pixels. It was last updated on April 20, 2026.

ImageTime SeriesGeospatialZIPSatellite ImageryDisaster MonitoringWildfire Detection+1

0 views

Multimodal & LLM

AMALIA-VL-SFT: Vision and Language Training Mix for Instruction Tuning

AMALIA-VL-SFT is a vision+language training dataset mix compiled for the AMALIA project. The dataset is provided by author 'amalia-llm' and was last updated on 2026-06-30. It is composed of multiple source datasets, each contributing a single train split, excluding those derived from the core LLM training mix.

MultimodalVision LanguageMultimodal TrainingComputer VisionSft DatasetInstruction Tuning+1

0 views

Multimodal & LLM

Multimodal Trajectory Prediction Model for Uncontrolled Intersections

A pre-trained model from the paper "Multimodal Trajectory Prediction via Topological Invariance for Navigation at Uncontrolled Intersections," presented at CoRL 2020. The model was developed by Junha Roh at the University of Washington. The underlying dataset likely contains multimodal data for predicting navigation trajectories.

MultimodalAutonomous NavigationUncontrolled IntersectionsMultimodal DataTrajectory PredictionTopological Invariance+1

0 views

Multimodal & LLM

PalmDex: A Multimodal Robotic Manipulation Dataset with Tactile Sensing

Rimbot's PalmDex is a multimodal robotic manipulation dataset collected via human teleoperation. It features synchronized dual-camera video, dual-hand tactile sensing, and hand pose tracking across diverse real-world environments, with action-segment level annotations. The dataset is described as a living collection, last updated on 2026-06-23.

MultimodalRobotic manipulationTeleoperationAction SegmentationTactile Sensing+1

0 views

Multimodal & LLM

NS-MFM-DGA: Anomaly Detection Data for Software-Defined Industrial Cyber-Physical Systems

86.7 KB of data supporting a neural-symbolic dynamic graph framework for real-time anomaly detection. The dataset, authored by Senlin Jiang, was last updated on May 30, 2026, and is shared under a CC-BY-4.0 license on figshare. It is associated with a model designed to address cross-modal attack evidence and dynamic topology changes in industrial systems.

GraphMultimodalZIPDynamic GraphCyber Physical SystemsAnomaly DetectionSoftware Defined Networks+1

0 views

Multimodal & LLM

Agri Cm3 Vision Unsloth

Agri-CM3-Vision-Unsloth is an English vision-only dataset prepared for fine-tuning Vision Language Models using Unsloth. It is a reformatted subset of the original large-scale Chinese agricultural pest and disease benchmark HIT-Kwoo/Agri-CM3, created by farukalamai and last updated on 2026-06-30. The dataset extracts only the English vision splits, keeping all image-based tasks and formatting them in the ShareGPT conversation format compatible with Unsloth.

MultimodalVision Language ModelsEnglish VisionBenchmarkHealthcareComputer VisionAgriculturePest DiseaseFine TuningLarge Scale+1

0 views

Multimodal & LLM

Spatial Grimoire and the Young Mage: 1.6M-Word Chinese Light Novel and AI Dialogue Dataset

An original Chinese light novel titled 'Spatial Grimoire and the Young Mage' comprises 1.6 million words of complete text. The dataset also includes AI dialogue and instruction-tuning data constructed around the novel's world view, character settings, and plot development. It was uploaded by user 'asd567557275' to Hugging Face and last updated on July 5, 2026.

TextAi DialogueFantasy FictionRoleplayLight NovelChinese Text+1

0 views

Multimodal & LLM

RobotDesign1M: A Large-scale Multimodal Dataset for Robot Design

RobotDesign1M is a large-scale, multimodal dataset built from image–text data curated from scientific literature across a wide range of robotics domains. It is designed by Fsoft-AIC to support research on design-aware foundation models, including design image generation, visual question answering, and design image retrieval.

MultimodalRobot DesignRoboticsComputer VisionLarge ScaleScientific Literature+1

0 views

Multimodal & LLM

MMPro-HIP: Multimodal Model for Elderly Hip Fracture Risk Prediction

1,287 elderly clinical records from Beijing Jishuitan Hospital, including 643 patients with hip fractures and 644 controls, were analyzed to develop a progressive fusion model for risk prediction. The model, created by Songyuan Chen, achieved an accuracy of 90.94% and an AUC of 0.9423 on an independent test set. The dataset was last updated on 2026-04 28.

MultimodalMedical RecordsClinical PredictionBenchmarkHealthcareMultimodal FusionElderly HealthHip Fracture Risk+1

0 views

PreviousPage 4 of 96Next