DataSalon

Discover quality datasets for AI training — aggregated from 40+ platforms, curated by AI.

ProductSearch Datasets Browse Topics Rankings Community API / MCP

ResourcesDocumentation Blog Changelog Status

LegalPrivacy Policy Terms of Service Cookie Policy

Multimodal & LLM Datasets | DataSalon

All Categories

🔗

Multimodal & LLM

Image-text pairs, instruction tuning, visual QA, cross-modal data, foundation model training data

1,956 datasets

Multimodal & LLM

Multilingual Cultural Q&A Benchmark Across 16 African Languages

Afri-MCQA is a multimodal cultural question-answering benchmark. It contains 8,000 Q&A pairs across 16 African languages from 13 countries, created by native speakers. The dataset was published by Atnafu and last updated in January 2026.

AudioMultimodalMultilingualMultilingual QaBenchmarkCultural AiAfrican LanguagesSpeech Text Parallel+1

0 views

Multimodal & LLM

VQA Zewail: Visual Question Answering Dataset

Kaggle hosts the VQA Zewail dataset, likely focused on visual question answering tasks. The dataset's specific content, size, and origin are not detailed in the provided metadata. Its creation date and last update are unknown.

MultimodalMultimodal AiComputer VisionNatural Language ProcessingVisual Question Answering+1

0 views

Multimodal & LLM

ROMA Proactive: Multimodal Streaming Video Interaction Data

A subset of the dataset introduced in the paper 'ROMA: Real-time Omni-Multimodal Assistant with Interactive Streaming Understanding'. This dataset is designed to train multimodal models for streaming video understanding, focusing on proactive interaction tasks. It was authored by EurekaTian and last updated on the Hugging Face platform in January 2026.

VideoMultimodalProactive InteractionMultimodal AiStreaming DataVideo Understanding+1

0 views

Multimodal & LLM

OmniSpatial: A Visual-Spatial Reasoning Benchmark for Vision Language Models

OmniSpatial is a benchmark dataset for evaluating spatial reasoning in vision-language models, as presented in an ICLR 2026 paper. The data is structured in a JSON schema with components like 'id' for question identification. The dataset was created by author 'qizekun' and last updated on January 27, -2026.

MultimodalSize Categories1 Kn10 KTask Categoriesimage Text To TextSpatial ReasoningLanguageenVision Language ModelsVision LanguageModalityimageArxiv250603135BenchmarkComputer VisionRegionusReasoningLicenseapache 20IclrMultimodal Benchmark+1

0 views

Multimodal & LLM

Maithili Instruction Tuning Dataset for Language Models

A dataset for instruction tuning, likely containing text prompts and responses in the Maithili language. It was published on the Hugging Face platform by the author Bansal123 and was last updated on March 1, 2026. The specific content, size, and collection methodology are not detailed in the available metadata.

TextMaithiliLlm TrainingNatural Language Processing+1

0 views

Multimodal & LLM

MMAU: Multi-Modal Audio Evaluation Benchmark with 12 Task Types

MMAU provides between 1,000 and 10,000 test records for evaluating audio large language models, released by TwinkStart in early 2026. It is integrated into the UltraEval-Audio framework to benchmark performance across 12 task types and 10 languages. The data spans four specialized domains: speech, general sound, medical audio, and music.

OPTIMIZED-PARQUETParquetSize Categories1 Kn10 KLibrarypolarsLibrarydaskModalityaudioModalitytextLibrarymlcroissantLibrarydatasetsRegionus+1

0 views

Multimodal & LLM

WAVLM Base Local: A Self-Supervised Speech Representation Model

WAVLM Base Local is a self-supervised speech representation model. It is hosted on the Kaggle platform, but the dataset's specific contents, size, and creation details are not provided in the available metadata. The model's architecture and training methodology are likely detailed in its associated research publication.

AudioFoundation ModelsSelf Supervised LearningSpeech ProcessingAudio Representation+1

0 views

Multimodal & LLM

Tagavlm Dataset for Multimodal Training

Tagavlm Dataset is a multimodal dataset hosted by HuggingFace, created by user tiredtony. It is intended for vision-language model training and was last updated in March 2026. Its specific contents and size are not detailed.

MultimodalVision Language ModelRegion UsMultimodal TrainingRegionus+1

0 views

Multimodal & LLM

VLM-Ready: 1,000 Historical Recipes with JSON Metadata

1,000 historical recipes prepared for Vision-Language Model training. The dataset includes JSON metadata, suggesting structured information about the recipes. It is hosted on Kaggle, but the original source and collection methodology are not detailed in the provided metadata.

MultimodalCookingVision Language ModelsHistorical RecipesCultural Heritage+1

0 views

Multimodal & LLM

Longhorn Dam Pedestrian and Bicycle Bridge Development in Austin

City of Austin data details the development of a new pedestrian and bicycle bridge over Lady Bird Lake near Longhorn Dam. The dataset is tagged for urban planning, geospatial analysis, and multimodal infrastructure within Austin, United States. It was last updated in March 2026.

GeospatialMultimodalUnited StatesPedestrian BridgeUrban PlanningAustinBicycle Infrastructure+1

0 views

Multimodal & LLM

Multimodal Driver Action Dataset for Distraction Analysis

3MDAD is a multimodal, multiview, and multispectral dataset focused on driver actions and distraction. It contains video and image data from multiple camera perspectives and spectral bands for analyzing driver behavior. The dataset was created for research in automotive safety and computer vision.

ImageMultimodalDriver BehaviorAction RecognitionComputer VisionAutomobiles And VehiclesAutomotive SafetyDeep Learning+1

0 views

Multimodal & LLM

Urban Friction Atlas for Place Suitability Prediction

Urban Friction Atlas is a multimodal dataset designed for place suitability prediction tasks. The dataset integrates multiple data types, as indicated by its platform tags, to model urban environments. The author, organization, and specific temporal coverage are not provided.

TabularMultimodalCities And Urban AreasCitiesPlace SuitabilityTime Series AnalysisMultimodal DataUrban PlanningGeospatial AnalysisDeep Learning+1

0 views

Multimodal & LLM

Turkish Image Captioning Dataset Translated from BLIP3o Pretraining Data

A subset of the BLIP3o-Pretrain-Long-Caption and BLIP3o-Pretrain-Short-Caption datasets translated into Turkish. The dataset is intended for training or fine-tuning image-to-text models. It was created by the author 'ituperceptron' and was last updated on January 15, 2026.

MultimodalMultilingualMachine TranslationComputer VisionImage Captioning+1

0 views

Multimodal & LLM

Deepchestvqa: Chest X-Ray Visual Question Answering Dataset

Deepchestvqa is a dataset hosted on HuggingFace by author ZiyueWang. The dataset's columns and sample data are unavailable, making its exact content and scale uncertain. It was last updated on March 7, 2026.

MultimodalMedical ImagingChest XrayVision LanguageVqaRadiology+1

0 views

Multimodal & LLM

Task1 Salience Conflict Image VQA: Visual Question Answering Dataset

A dataset likely designed for Visual Question Answering (VQA) tasks, focusing on salience and conflict within images. It is hosted on Kaggle, but specific details about its size, creation date, and authorship are unknown. The dataset's content and scope require verification after download.

MultimodalSalienceComputer VisionConflictVisual Question Answering+1

0 views

Multimodal & LLM

MHAL Dataset Annotations for LLaVA

MHAL Dataset Annotations for LLaVA is a dataset published on Kaggle. The title suggests it contains annotations for the LLaVA (Large Language and Vision Assistant) model, likely involving multimodal data linking images and text. The dataset's specific content, size, and authorship are unknown.

MultimodalVision LanguageLlavaMultimodal AnnotationImage Captioning+1

0 views

Multimodal & LLM

SBSFigures: Pre-training Figure QA from Synthesized Images

A dataset for figure question-answering, synthesized for pre-training models. It was created by researchers including Risa Shionoda and Kuniaki Saito for the AAAI-25 Workshop on Document Understanding and Intelligence. The dataset page was last updated on 2026-01-18.

MultimodalDocument UnderstandingFigure QaMultimodal QaLarge ScaleSynthetic Images+1

0 views

Multimodal & LLM

Pano VQA: Panoramic Visual Question Answering Dataset

Pano VQA is a dataset hosted on Hugging Face by the user 'wakinghours', last updated on March 2, 2026. Its title suggests it is designed for Visual Question Answering tasks involving panoramic or wide-field-of-view imagery. The dataset's specific content, scale, and structure require verification after download as metadata is minimal.

MultimodalMultimodal AiComputer VisionNatural Language ProcessingVisual Question Answering+1

0 views

Multimodal & LLM

data_vlm_diff_ready_40_cmd: Vision-Language Model Diffusion-Ready Data

Kaggle dataset titled 'data_vlm_diff_ready_40_cmd'. The name suggests a collection of data prepared for vision-language models and diffusion processes. The dataset's specific content, size, and origin are not detailed in the provided metadata.

MultimodalVision Language ModelCommand GenerationMultimodal Diffusion+1

0 views

Multimodal & LLM

Pisc Tr: Multimodal Question Answering with Chain-of-Thought Data

A multimodal dataset from the LLaVA-CoT project, likely containing image-question-answer pairs structured for visual reasoning tasks. The dataset includes a train.jsonl file with conversation data linking images to questions and answers, suggesting a format for training or evaluating vision-language models. It was authored by 'berhaan' and last updated on 2026-01-17.

MultimodalMultimodal QaImage TextComputer VisionCotVisual Reasoning+1

0 views

PreviousPage 52 of 98Next