DataSalon

Discover quality datasets for AI training — aggregated from 40+ platforms, curated by AI.

ProductSearch Datasets Browse Topics Rankings Community API / MCP

ResourcesDocumentation Blog Changelog Status

LegalPrivacy Policy Terms of Service Cookie Policy

Multimodal & LLM Datasets | DataSalon

All Categories

🔗

Multimodal & LLM

Image-text pairs, instruction tuning, visual QA, cross-modal data, foundation model training data

1,944 datasets

Multimodal & LLM

Ubuntu OSWorld Verified Trajectories: 100K+ Multimodal Agent Paths

OSWorld-Verified Model Trajectories contains between 100,000 and 1,000,000 evaluation records of multimodal AI agents performing tasks in real computer environments. Created by xlangai and updated in March 2026, the data captures verified execution paths and screenshots from state-of-the-art models tested on the OSWorld benchmark.

Size Categories100 Kn1 MCodeRegionusLicensemit+1

0 views

Multimodal & LLM

ScreenSpot-Pro: GUI Grounding for Professional High-Resolution Software

ScreenSpot-Pro contains between 1,000 and 10,000 high-resolution GUI screenshots for grounding tasks, published by likaixin in 2026. It targets professional software environments on macOS, specifically providing labeled coordinates for icons and text elements in tools like Visual Studio Code, PyCharm, and Android Studio.

Size Categories1 Kn10 KTask Categoriesimage Text To TextLanguageenBenchmarkofficialBenchmarkeval YamlRegionusAgentLicensemit+1

0 views

Multimodal & LLM

Text-to-Image DPO Human Preferences: 416,360 Pairwise Judgments

DatapointAI created a dataset of 416,360 pairwise human judgments comparing AI-generated images. The data was collected from approximately 20,000 annotators, focusing on prompt alignment and overall preference. The full, unfiltered version was last updated on March 30, 2026.

TabularMultimodalPairwise ComparisonAi EvaluationHuman PreferenceBenchmarkText To ImageComputer VisionSynthetic+1

0 views

Multimodal & LLM

Openjudge

A benchmark dataset for evaluating graders across text, multimodal, and agent scenarios. It supports the OpenJudge framework with labeled preference pairs for quality-assured grader development. The dataset was created by agentscope-ai and last updated on March 4, —.

TextMultimodalAi EvaluationBenchmarkGrader DevelopmentPreference Pairs+1

0 views

Multimodal & LLM

Llm Training Antenna Design: Synthetic Multi-Band Antenna Designs

Procedurally generated antenna designs across various frequency bands and configurations, created to be technically consistent and realistic. The sample is authored by CJJones, with a full dataset of 100,000 records available externally. The dataset page was last updated on 2026-03-08.

TabularProcedural GenerationAntenna DesignRadio FrequencyMulti BandSynthetic DataSynthetic+1

0 views

Multimodal & LLM

Text-2-Image DPO Human Preferences: 40,000 Trust-Weighted Judgments

A quality-controlled human preference dataset for text-to-image generation. It contains 40,000 trust-weighted pairwise judgments from calibrated annotators, comparing AI-generated images on prompt alignment and overall preference. This subset, created by datapointai, is described as the highest-annotator-quality version.

MultimodalAi EvaluationBenchmarkText To ImageComputer VisionHuman PreferencesDpoSynthetic+1

0 views

Multimodal & LLM

Text-2-Image DPO Human Preferences: 80,000 Trust-Weighted Pairwise Judgments

80,000 trust-weighted pairwise judgments from calibrated annotators compare AI-generated images on prompt alignment and overall preference. The dataset was built on the Datapoint annotation platform for collecting high-quality human preference data at scale. It was authored by datapointai and last updated on March 30, 2026.

MultimodalGenerative AiPairwise ComparisonAi EvaluationBenchmarkText To ImageComputer VisionHuman PreferencesLarge ScaleSynthetic+1

0 views

Multimodal & LLM

VLM Direction Testbed: A Multimodal AI Evaluation Dataset

A dataset hosted on HuggingFace by author takhyun03, last updated on 2026-05-06. It is described by platform tags as a testbed for Vision Language Models (VLMs) and multimodal AI evaluation. The specific content, size, and structure are not detailed in the provided metadata.

MultimodalVision Language ModelsModel EvaluationMultimodal AiTestbed+1

0 views

Multimodal & LLM

Korean Multimodal Exam Questions for College-Level Reasoning

3,466 multimodal questions combine images with Korean text to evaluate advanced reasoning. The dataset is sourced from Korean civil service, technical qualification, and academic olympiad exams, created by HAERAE-HUB. Its structure and specific column details are not provided in the input.

IMAGEFOLDERSize Categories1 Kn10 KLibrarymlcroissantModalityimageLibrarydatasetsLicensecc By Nc 40Regionus+1

0 views

Multimodal & LLM

LUCID: Lunar Captioned Image Dataset for Vision-Language Training

LUCID is a large-scale multimodal dataset for vision-language training on real lunar surface observations. It was introduced as part of the paper 'LLaVA-LE: Large Language-and-Vision Assistant for Lunar Exploration' (Inal et al., 2025, under review) and is hosted by the author 'pcvlab'. The dataset was last updated on the platform in March 2026.

GeospatialMultimodalOPTIMIZED-PARQUETParquetTask Categoriesimage Text To TextLibrarypolarsTask Categoriesimage To TextLibrarydaskLanguageenTask Categoriesvisual Question AnsweringPanchromaticModalitytextSize Categories100 Kn1 MLibrarymlcroissantVision LanguageModalityimageLibrarydatasetsLicensecc By 40Computer VisionModalitygeospatialRegionusLarge ScalePlanetary ScienceArxiv260324696Lunar ImageryLunar+1

0 views

Multimodal & LLM

Multimodal CAPTCHA-Solving Agent Training Data

ReCAP-187K-SFT contains supervised fine-tuning data for training multimodal GUI agents to solve CAPTCHAs. The dataset is structured in Qwen3-style conversation format and includes references to screenshot images from interaction trajectories. It was created by ReCAP-Agent and last updated in March 2026.

JSONTask Categoriesimage Text To TextLicenseotherLibrarypolarsLibrarydaskModalitytextSize Categories100 Kn1 MLibrarymlcroissantLibrarydatasetsRegionus+1

0 views

Multimodal & LLM

ConsistCompose3M: 3 Million Samples for Multimodal Layout-Controlled Image Composition

ConsistCompose3M provides approximately 3 million samples for layout-controllable multi-instance image composition. The dataset, created by sensenova, offers structured spatial-semantic supervision and includes identity-preserving samples filtered by CLIP/DINO similarity. It was last updated on March 31, 2026.

MultimodalComputer VisionAi Training DataLarge ScaleMultimodal Layout+1

0 views

Multimodal & LLM

SidewalkVQA: Visual Question Answering for Street Scene Understanding

SidewalkVQA is a dataset hosted on Kaggle, likely containing images of street scenes paired with questions and answers. The dataset's specific size, creation date, and author are unknown from the provided metadata. Its content and structure require verification after download.

MultimodalMultimodal AiStreet ScenesComputer VisionVisual Question Answering+1

0 views

Multimodal & LLM

BRIGHT: Multimodal Satellite Imagery for Disaster Response Across 14 Regions

An open-access multimodal dataset curated by Kullervo for AI-based disaster response. It contains about 4,200 paired optical and SAR images covering five natural and two man-made disaster types across 14 global regions, with a focus on developing countries. The dataset includes over 380,000 building instances at spatial resolutions between 0.3 and 1 meter.

GeospatialMultimodalLanguageenSize Categories1 Bn10 BSatellite ImageryModalityimageLicensecc By Sa 40Task Categoriesfeature ExtractionTask Categorieszero Shot ClassificationArtificial IntelligenceTask Categoriesimage SegmentationBuilding Damage MappingModalitygeospatialBuilding DamageEarth ObservationRegionusDoi1057967hf6963Disaster Response+1

0 views

Multimodal & LLM

VQA Book: Visual Question Answering Dataset for Books

Vqa Book is a dataset hosted on Hugging Face by nguyenhung310505. The dataset was last updated on 2026-05-14. Its specific content and scale are unknown from the provided metadata.

MultimodalBooksMultimodal QaVisual Question Answering+1

0 views

Multimodal & LLM

BD-HazardVLM_500: Vision-Language Dataset for Hazard Detection

BD-HazardVLM_500 is a dataset published on Kaggle. Its title suggests a focus on hazard detection, likely containing 500 examples for vision-language model tasks. The dataset's specific content, collection method, and temporal scope are not detailed in the available metadata.

MultimodalVision Language ModelMultimodal AiHazard Detection+1

0 views

Multimodal & LLM

Human Hallucination Verification Dataset for Multimodal Models

HHVD is a Human Hallucination Verification Dataset for multimodal hallucination verifiability. It contains 4,470 time-constrained human responses to image-text pairs, designed to evaluate obvious and elusive hallucinations. The dataset was created by BeEnough and last updated in April 2026.

MultimodalImage Text PairsBenchmarkHuman EvaluationComputer VisionVerification BenchmarkMultimodal Hallucination+1

0 views

Multimodal & LLM

Interactionindex: Synthetic Preference Data for Coding and Safety Tasks

A synthetic preference dataset created by 8F-ai and last updated in March 2026. It is organized into four subsets focused on coding tasks, safety-sensitive refusals, honesty checks, and everyday assistant behavior. The dataset is designed for preference modeling, dataset tooling, and RLHF-style experimentation.

TextJSONSize Categories1 Kn10 KSafetyLibrarypolarsLibrarydaskModalitytextLibrarymlcroissantLibrarydatasetsRegionusCodingSynthetic DataHuman FeedbackLicensemitSyntheticPreference Modeling+1

0 views

Multimodal & LLM

CC3M: Conceptual Captions with 3.3M Web-Harvested Image-Text Pairs

Approximately 3.3 million images annotated with captions harvested from web image alt-text attributes. The dataset was created to provide a wider variety of caption styles compared to curated datasets. It is hosted on Hugging Face by author 'chaocq' and was last updated on March 17,我们发现了一个错误。输入中的最后更新日期是2026-03-17，这明显是一个未来的日期，可能是数据录入错误。根据事实性协议，对于这种明显错误，我们应直接陈述输入中的事实，不做推断或修正。因此，在摘要中应直接使用该日期。

MultimodalWEBDATASETImage To TextLicenseotherTask Categoriesimage To TextSize Categories1 Mn10 MLibrarywebdatasetModalitytextLibrarymlcroissantModalityimageAlt TextLibrarydatasetsComputer VisionMultimodal CaptionsRegionusWeb Harvested+1

0 views

Multimodal & LLM

KOL Decisions: Web3 Community Manager Instruction Tuning Dataset

KOL Decision-Making Dataset for Web3 Community Managers is designed for instruction tuning. The dataset likely contains examples of decisions or actions taken by Key Opinion Leaders in Web3 communities. Its origin and scale are unspecified, as the description metadata is limited.

TabularCommunity ManagementWeb3Decision MakingInstruction Tuning+1

0 views

PreviousPage 30 of 97Next