DataSalon

Discover quality datasets for AI training — aggregated from 40+ platforms, curated by AI.

ProductSearch Datasets Browse Topics Rankings Community API / MCP

ResourcesDocumentation Blog Changelog Status

LegalPrivacy Policy Terms of Service Cookie Policy

All Categories

🔗

Multimodal & LLM

Image-text pairs, instruction tuning, visual QA, cross-modal data, foundation model training data

1,947 datasets

Multimodal & LLM

LLaVA-LoRA-Noisy-Baseline-Final: Vision-Language Instruction Tuning Data

Kaggle hosts the LLaVA-LoRA-Noisy-Baseline-Final dataset. The title suggests it is likely related to instruction tuning for vision-language models, specifically for the LLaVA (Large Language-and-Vision Assistant) architecture using LoRA (Low-Rank Adaptation) techniques. It may contain a baseline dataset with noisy annotations intended for model training or evaluation.

MultimodalVision LanguageLlavaMultimodal AiBenchmark+1

0 views

Multimodal & LLM

LLM Preference Alignment: Stakeholder Data for Chinese Urban Planning

Shimin Qi produced this dataset in 2026 to support Reinforcement Learning from Human Feedback (RLHF) for large language models in urban planning. It comprises raw redevelopment data from Chinese municipal websites and multi-stakeholder annotated preference pairs used to fine-tune ChatGLM3-6B.

Social SciencesComputer and Information Science+1

0 views

Multimodal & LLM

M3-MedQA: Multilingual Medical VQA Benchmark in Five Languages

M3-MedQA contains between 1,000 and 10,000 medical image-question pairs across five languages, developed by pnu-clink in 2024. It extends the WorldMedQA-V dataset to evaluate cross-lingual consistency and medical reasoning in English, Korean, Japanese, Arabic, and Wolof.

OPTIMIZED-PARQUETParquetSize Categories1 Kn10 KLibrarypolarsModalitytextLibrarymlcroissantModalityimageLibrarydatasetsLibrarypandasRegionusArxiv241012722Licensemit+1

0 views

Multimodal & LLM

Spatial457: 6D Spatial Reasoning Benchmark with 10K-100K Records

Spatial457 contains between 10,000 and 100,000 image-text pairs designed for 6D spatial reasoning diagnostics. Created by researchers at Johns Hopkins University and DEVCOM Army Research Laboratory in 2025, it benchmarks the ability of multimodal models to interpret 3D orientations. The data is released under an Apache 2.0 license.

MultimodalSize Categories10 Kn100 KTask Categoriesimage Text To TextSpatial ReasoningLanguageenArxiv250208636RegionusLicenseapache 20+1

0 views

Multimodal & LLM

Heal Medvqa: Medical Visual Question Answering Dataset

Heal Medvqa is a dataset for medical visual question answering, likely containing image-text pairs. The dataset was published on huggingface by the author tuandung2812 and was last updated on 2026-04 13. Its specific content, scale, and collection methodology require verification after download.

MultimodalMultimodal QaVision LanguageMedical VqaHealthcare Ai+1

0 views

Multimodal & LLM

Gemma3 VLM Finetune Dataset

Gemma3_vlm_finetune_dataset is a dataset published on Kaggle, likely intended for fine-tuning vision-language models. Its specific content, size, and structure are not described in the available metadata. The dataset's author, organization, and license details are unknown.

MultimodalVision Language ModelMultimodal AiGemmaFine Tuning+1

0 views

Multimodal & LLM

Human Behavior Atlas: A Unified Multimodal Benchmark for Behavioral Understanding

Human Behavior Atlas aggregates and standardizes multiple behavioral datasets into a single training and evaluation framework. The dataset, created by HumanBehaviorAtlas, was last updated on Hugging Face in February 2026. It is designed to enable consistent training and evaluation of foundation models on psychological and social behavior tasks.

MultimodalPsychologyBenchmarkHealthcareSocial SignalsHuman BehaviorMultimodal Benchmark+1

0 views

Multimodal & LLM

Ghost Hunter RLHF: 8-Bit FPS Gameplay Screenshots for Preference Modeling

Ghosthunter RLHF contains under 1,000 gameplay screenshots from the 8-bit first-person shooter "Ghost Hunter," developed by webxos and updated in March 2026. The collection captures specific instances of successful ghost elimination using precision auto-fire to facilitate reinforcement learning from human feedback (RLHF).

MultimodalIMAGEFOLDERTask Categoriesreinforcement LearningRlhfTask Categoriesvisual Question AnsweringSize Categoriesn1 KLibrarymlcroissantVision LanguageModalityimageGameLibrarydatasetsTask Categoriesimage ClassificationRegionusReinforcement LearningGymGameplayLicensemitPreference Modeling+1

0 views

Multimodal & LLM

Predictions LLaVA: Vision-Language Model Outputs

Kaggle hosts this dataset titled 'predictions_llava'. The dataset likely contains outputs or predictions from the LLaVA (Large Language-and-Vision Assistant) model. Its specific scale, origin, and creation date are not detailed in the available metadata.

MultimodalVision LanguageLlavaPredictionsMultimodal Predictions+1

0 views

Multimodal & LLM

Pred_LLaVA_LLaVA: Multimodal Model Predictions

Pred_LLaVA_LLaVA likely contains predictions from a vision-language model evaluation. Published on Kaggle, the dataset's specific content and scale require verification after download. Its platform tags suggest it is part of a multimodal benchmark.

MultimodalVision LanguageModel PredictionLlm EvaluationMultimodal Benchmark+1

0 views

Multimodal & LLM

GAIA: Global Multimodal Remote Sensing Image-Text Pairs

GAIA is a large-scale vision-language dataset containing 205,150 image-text pairs designed to bridge the gap between remote sensing imagery and natural language understanding. The dataset is global, multimodal, and multiscale, as described in the associated research paper. It was uploaded to Hugging Face by author azavras and last updated on February 11, -2026.

GeospatialMultimodalJSONSize Categories10 Kn100 KLibrarypolarsTask Categoriesimage To TextLanguageenModalitytextModalitytabularLibrarymlcroissantVision LanguageSatellite ImageryModalityimageLibrarydatasetsLibrarypandasComputer VisionEarth ObservationRegionusLarge ScaleNatural Language ProcessingLicensemit+1

0 views

Multimodal & LLM

Ten Visual Question Answering Failure Cases for Qwen3.5-Base Model

This dataset documents 10 specific failure cases where the Qwen3.5-Base-0.8B vision-language model produced incorrect answers on visual question answering tasks. The examples were sampled from the SimpleVQA benchmark and include the original image, question, expected answer, and the model's actual output.

OPTIMIZED-PARQUETParquetBlind SpotsLibrarypolarsLanguageenTask Categoriesvisual Question AnsweringQwenSize Categoriesn1 KModalitytextLibrarymlcroissantEvaluationModalityimageLibrarydatasetsLibrarypandasRegionusVqaLicensemitFailure Analysis+1

0 views

Multimodal & LLM

Open Problems in Multimodal Sparse Data

A dataset likely focused on challenges in multimodal machine learning with sparse data representations. It is hosted on Kaggle, but its specific size, creator, and update history are unknown. The content likely involves multiple data types combined with sparse feature sets.

MultimodalOpen ProblemsMachine LearningSparse Data+1

0 views

Multimodal & LLM

Chatr1 Convqa All: Conversational Question Answering Data

HuggingFace hosts the Chatr1 Convqa All dataset, authored by slupart. The dataset was last updated on 2026-04-15. Its title suggests it likely contains conversational question-answering data, but specific content, size, and structure are not detailed in the provided metadata.

TextConversational QaQuestion AnsweringLlm TrainingChat Data+1

0 views

Multimodal & LLM

VLM-SubtleBench: 10,000+ Image Pairs for Subtle Comparative Reasoning

VLM-SubtleBench provides between 10,000 and 100,000 image pairs to evaluate the subtle comparative reasoning capabilities of Vision-Language Models. Developed by KRAFTON and released in early 2026, the dataset targets domains where visual differences are nuanced, such as medical imaging and industrial anomaly detection.

Size Categories10 Kn100 KArxiv260307888Task Categoriesimage To TextLanguageenTask Categoriesvisual Question AnsweringComparative ReasoningModalityimageMulti ImageBenchmarkLicensecc By Nc 40RegionusSubtle DifferenceVlm+1

0 views

Multimodal & LLM

Multimodal Multi-Turn Dialogue Safety Evaluations

The dataset, last updated in March 2026, is designed for safeguarding Vision-Language Models (VLMs). It focuses on adversarial robustness and safety alignment for interactive, multi-turn conversations. The dataset was created by author leost233.

JSONSize Categories1 Kn10 KTask Categoriesimage Text To TextLibrarypolarsLanguageenModalitytextLibrarymlcroissantModalityimageLibrarydatasetsLibrarypandasLicensecc By Nc 40RegionusArxiv250925896+1

0 views

Multimodal & LLM

Spanish Government Multimodal QA: CVs and Images of 683 Officials

683 official Curriculum Vitaes and profile images of Spanish Government members sourced from the transparency portal transparencia.gob.es. The dataset includes 323 questions in Spanish about properties of the CVs and images, such as matching attire color with ministry affiliation. Author megaelius published it on HuggingFace, with a last recorded update in March 2026.

MultimodalMultimodal QaOfficial DataTransparencyCurriculum VitaeSpanish Government+1

0 views

Multimodal & LLM

HFLB: Heterogeneous Federated Learning Benchmark with 100K-1M VQA Records

HFLB is a benchmark for heterogeneous federated learning containing between 100,000 and 1,000,000 records, developed by SNUMPR for the FedMosaic (ICLR 2026) study. It modifies constituent datasets like GQA and Abstract VQA into distinct subtasks to support task incremental learning research.

Task Categoriesquestion AnsweringLanguageenSize Categories100 Kn1 MArxiv190106706RegionusAgentArxiv230812305+1

0 views

Multimodal & LLM

SpaRRTa: 149,145 Synthetic Image-Text Pairs for Spatial Intelligence

SpaRRTa contains 149,145 synthetic paired samples designed to evaluate spatial intelligence in visual foundation models, published by turhancan97 in 2026. The collection features images embedded in Parquet shards alongside detailed metadata describing scene variants and spatial configurations.

ParquetLibrarypolarsLibrarydaskSpatial IntelligenceLanguageenModalitytextSize Categories100 Kn1 MLibrarymlcroissantModalityimageLibrarydatasetsTask Categoriesimage ClassificationRegionusArxiv260111729Licensemit+1

0 views

Multimodal & LLM

MC-Search: Benchmark for Multimodal Agentic Search with Long Reasoning Chains

MC-Search is a benchmark dataset for evaluating and enhancing multimodal agentic search with structured long reasoning chains. The dataset focuses on open-world settings where Large Multimodal Models (LMMs) operate. It was created by YennNing and last updated on February 22, 2026.

MultimodalBenchmark EvaluationAgentic SearchMultimodal AiBenchmarkReasoning Chains+1

0 views

PreviousPage 39 of 97Next

Multimodal & LLM Datasets | DataSalon