DataSalon

Discover quality datasets for AI training — aggregated from 40+ platforms, curated by AI.

ProductSearch Datasets Browse Topics Rankings Community API / MCP

ResourcesDocumentation Blog Changelog Status

LegalPrivacy Policy Terms of Service Cookie Policy

Multimodal & LLM Datasets | DataSalon

All Categories

🔗

Multimodal & LLM

Image-text pairs, instruction tuning, visual QA, cross-modal data, foundation model training data

1,947 datasets

Multimodal & LLM

MultimodalLLM-Moroccan-SLT: Moroccan Sign Language Translation Data

MultimodalLLM-Moroccan-SLT is a dataset hosted on Kaggle. The title suggests it likely contains data for Moroccan Sign Language Translation, potentially involving multiple modalities such as video or images paired with text. The dataset's specific content, size, and authorship are unknown and require verification after download.

MultimodalSign Language TranslationMultimodal LlmComputer VisionNatural Language ProcessingMoroccan+1

0 views

Multimodal & LLM

Concept-Annotated Multimodal Image-Text Dataset

DataConcept-128M contains 128 million web-crawled image-text pairs annotated with fine-grained concept composition details. It is derived from DataComp-CLIP and designed to enable Concept-Aware Batch Sampling for multimodal pretraining.

VideoParquetLibrarypolarsLibrarydaskLanguageenSize Categories100 Mn1 BModalitytextModalitytabularLibrarymlcroissantModalityimageLibrarydatasetsPretrainingArxiv251120643Task Categorieszero Shot ClassificationRegionusVlmDatacompLicensemit+1

0 views

Multimodal & LLM

Vqagent Pairwise Preference: Human Feedback Data for RL

A dataset titled 'Vqagent Pairwise Preference' was published on the Hugging Face platform by the user 'qgfvadfuvads'. The title suggests it contains pairwise preference comparisons, likely used for training or evaluating reinforcement learning agents. The dataset was last updated on April 12, 2026.

TabularPreference LearningPairwise ComparisonReinforcement LearningHuman Feedback+1

0 views

Multimodal & LLM

VisRes Bench: 10K-100K Visual Reasoning Tests for VLMs

VisRes Bench contains 10,000 to 100,000 image-text pairs designed to evaluate the visual reasoning of Vision-Language Models (VLMs) in naturalistic settings. Developed by researchers at TII (tiiuae) and updated in March 2026, it isolates visual logic by removing contextual language supervision.

MultimodalParquetSize Categories10 Kn100 KVisionLibrarypolarsArxiv251221194Task Categoriesimage To TextLibrarydaskLanguageenTask Categoriesvisual Question AnsweringModalitytextLibrarymlcroissantEvaluationModalityimageLibrarydatasetsBenchmarkRegionusReasoningLicenseapache 20+1

0 views

Multimodal & LLM

Quran-MD: Fine-Grained Multimodal Quran Data at the Verse Level

Quran-MD is a multimodal dataset of the Qur'an integrating textual, linguistic, and audio dimensions at the verse and word levels. The dataset was created by 'yourmumisacow' and is associated with a paper accepted at the 5th Muslims in ML Workshop co-located with NeurIPS 2025. The specific ayah-level subset was last updated on February 21, 2026.

AudioMultimodalQuranReligious TextLinguistics+1

0 views

Multimodal & LLM

LLaVA-15-7B-HF: A Multimodal Instruction-Tuning Dataset

A dataset likely associated with the LLaVA (Large Language-and-Vision Assistant) project for training multimodal AI models. It was published on Kaggle, but its specific contents, size, and creation details are not provided in the metadata. The dataset name suggests it is designed for instruction-following tasks involving both visual and textual data.

MultimodalVision LanguageMultimodal Llm+1

0 views

Multimodal & LLM

BED: Blind 3D Dataset for Vision-Language Models

A dataset for Vision-Language Models (VLMs) focused on the Blind 3D (B3D) task. The dataset was created by VietMedTeam, with main authors Nguyen Kim Hai Bui and An Ngo Xuan, and was last updated on April 1, 2026.

MultimodalOPTIMIZED-PARQUETParquetSize Categories1 Kn10 KLibrarypolarsVision Language ModelsModalitytext3D VisionLibrarymlcroissantModalityimageLibrarydatasetsLibrarypandasBlind 3dRegionusMultimodal Benchmark+1

0 views

Multimodal & LLM

checkpoint_vqa: Visual Question Answering Dataset

A dataset titled 'checkpoint_vqa' hosted on Kaggle. The dataset's title suggests it is related to Visual Question Answering, a multimodal AI task combining images and text. The author's description indicates they are a person with autism requesting patience, but no further metadata about the dataset's content, size, or origin is provided.

MultimodalMultimodal AiComputer VisionNatural Language ProcessingVisual Question Answering+1

0 views

Multimodal & LLM

Multimodal Brain Imaging Benchmark for Clinical Task Analysis

OmniBrainBench is a multimodal benchmark dataset for brain imaging analysis across multi-stage clinical tasks. The dataset was created by FrankPN and is associated with a CVPR 2026 paper. Specific details on row count, column count, and data size are not provided in the input.

CSVSize Categories1 Kn10 KLibrarypolarsLicensecc By Sa 30Task Categoriesquestion AnsweringLanguageenBrainModalitytextLibrarymlcroissantModalityimageLibrarydatasetsLibrarypandasRegionusVlmVqaMedicalArxiv251100846+1

0 views

Multimodal & LLM

VLM4D: Spatiotemporal Reasoning Benchmark with 1,000 Video Samples

VLM4D is a benchmark of approximately 1,000 real-world and synthetic videos designed to evaluate spatiotemporal reasoning in Vision Language Models. Developed by Shijie Zhou and researchers at UCLA in 2025, the dataset provides curated video-text pairs to test model awareness of motion and time.

Size Categoriesn1 KLibrarymlcroissantArxiv250802095Task Categoriesvideo Text To TextLibrarydatasetsModalityvideoRegionusLicensemit+1

0 views

Multimodal & LLM

Mdpbench Vlmevalkit: A Benchmark for Vision-Language Model Evaluation

Mdpbench Vlmevalkit is a dataset published on HuggingFace by Delores-Lin. It was last updated on April 13, 2026. The dataset's title suggests it is a benchmark for evaluating vision-language models.

MultimodalEvaluation KitAi BenchmarkingLarge Language ModelMultimodal Benchmark+1

0 views

Multimodal & LLM

FineVision-vlmbench-mini: 128 Rows for VLM Inference Performance Benchmarking

FineVision-vlmbench-mini is a 128-row subset of the FineVision dataset, designed for benchmarking Vision-Language Model inference performance. The dataset was created by the author vlm-run and last updated on 2026-02-08. It is intended to provide a realistic and diverse workload for measuring metrics like tokens per second and VRAM usage.

MultimodalAi EvaluationMultimodal DataInference PerformanceVlm Benchmark+1

0 views

Multimodal & LLM

Human Behavior Atlas: 100K+ Multimodal Records for Social Intelligence

Human Behavior Atlas (HBA) is a multimodal benchmark aggregating between 100,000 and 1,000,000 records for psychological and social behavior analysis, published by keentomato. It standardizes diverse behavioral datasets into a single framework for training foundation models on signals like emotion, intent, and sarcasm. The collection spans text, audio, image, and video modalities to support social intelligence tasks.

MultimodalJSONTask Categoriestext GenerationLicenseotherLibrarypolarsLanguageenSocial IntelligenceModalitytextSize Categories100 Kn1 MPsychologyLibrarymlcroissantTask Categoriesaudio ClassificationBenchmarkingLibrarydatasetsLibrarypandasTask Categoriesimage ClassificationRegionusTask Categoriestext ClassificationHuman BehaviorTask Categoriesvideo Classification+1

0 views

Multimodal & LLM

VLMData: Vision-Language Model Training Data

VLMData is a dataset published on Kaggle, likely containing data for training or evaluating Vision-Language Models. The dataset's specific content, size, and origin are not detailed in the available metadata. Its structure and intended use must be verified after download.

MultimodalVision LanguageMultimodal AiVlm+1

0 views

Multimodal & LLM

Joy Captioning 20250408A: 100K-1M Image-Text Pairs for VLM Training

Joy Captioning 20250408A contains between 100,000 and 1,000,000 image-text pairs used for the initial training of the JoyCaption Beta One vision-language model. Created by fancyfeast and updated in early 2026, the collection focuses on detailed image descriptions and visual question-answering tasks. The data includes a mix of human-written and machine-generated text, explicitly labeled for provenance.

ParquetLibrarypolarsLibrarydaskLanguageenTask Categoriesvisual Question AnsweringModalitytextSize Categories100 Kn1 MLibrarymlcroissantLibrarydatasetsRegionusVlmCaptioningVqaLicensemitJoycaption+1

0 views

Multimodal & LLM

iOS Mobile UI Screens with 1,000 Annotated Examples

1,000 real-world iOS mobile UI screens collected from diverse application categories on the Apple App Store. Each screen is paired with human-validated structured JSON ground truth annotations, enabling research in UI understanding and layout analysis. The dataset was created by atharparvezce and last updated on Hugging Face in February 2026.

MultimodalTask Categoriesobject DetectionLanguageenVision LanguageTask Categoriesimage ClassificationComputer VisionLicensecc By Nc 40Task CategoriesotherRegionusLayout AnalysisMobile UiIos+1

0 views

Multimodal & LLM

Youtube Videos Fvqa Short 0401: Short Video Question-Answering Dataset

Stephengzk published this dataset on Hugging Face on April 4, 2026. The title suggests it contains YouTube videos, likely short-form content, associated with a visual question-answering (FVQA) task. The dataset's specific content, scale, and structure require verification after download due to minimal provided metadata.

MultimodalMultimodal QaShort Form Content+1

0 views

Multimodal & LLM

Robointer Vqa: Visual Question Answering Dataset for Robotic Manipulation

RobotInter-VQA is a Visual Question Answering dataset for robotic manipulation, developed as part of the RoboInter project. It covers generation and understanding of Intermediate Representations for task planning and is built on annotations from RoboInter-Data, with raw robot datasets sourced from DROID and RH20T. The dataset was created by InternRobotics and was last updated on February 14, 2026.

MultimodalIntermediate RepresentationRoboticsVisual Question AnsweringManipulation+1

0 views

Multimodal & LLM

STVQA-7K: Spatial Visual Question Answering Dataset with 7,587 Samples

STVQA-7K is a high-quality spatial visual question answering dataset comprising 7,587 samples. It was created by hunarbatra and last updated on 2026-01-29. The dataset is fully grounded in human-annotated scene graphs from Visual Genome and is designed for training and evaluating spatial reasoning capabilities in multimodal large language models.

MultimodalSpatial ReasoningMultimodal AiScene GraphVisual Question Answering+1

0 views

Multimodal & LLM

VLM Dynamic Model Information: Vision-Language Model Evaluation Data

VLM Dynamic Model Information is a dataset related to vision-language models and their evaluation, published on the Hugging Face platform. The dataset was created by 'open-cn-llm-leaderboard' and was last updated on April 3, 2026. The specific content, scale, and structure require verification after download as metadata is minimal.

MultimodalVision Language ModelMultimodal AiLlm EvaluationRegionusDynamic ModelLicenseapache 20+1

0 views

PreviousPage 41 of 97Next