DataSalon

Discover quality datasets for AI training — aggregated from 40+ platforms, curated by AI.

ProductSearch Datasets Browse Topics Rankings Community API / MCP

ResourcesDocumentation Blog Changelog Status

LegalPrivacy Policy Terms of Service Cookie Policy

Multimodal & LLM Datasets | DataSalon

All Categories

🔗

Multimodal & LLM

Image-text pairs, instruction tuning, visual QA, cross-modal data, foundation model training data

1,948 datasets

Multimodal & LLM

Test Images VQA: Visual Question Answering Benchmark Dataset

A dataset titled 'test-images-vqa' is hosted on Kaggle. The dataset likely contains images paired with questions and answers for visual question answering tasks. Metadata such as size, columns, and license are currently unknown.

MultimodalMultimodal AiComputer VisionNatural Language ProcessingVisual Question Answering+1

0 views

Multimodal & LLM

BLIP: Vision-Language Pre-training Test Data

BLIP_test is a dataset hosted on Kaggle. Its title suggests it is likely related to the BLIP (Bootstrapping Language-Image Pre-training) model, a vision-language framework. The dataset's specific content, size, and structure are unknown from the provided metadata.

MultimodalVision LanguageComputer VisionImage Captioning+1

0 views

Multimodal & LLM

LLaVA Annotations for PASCAL VOC: Visual Question Answering Labels

llava-annotations-pascal-voc is a dataset hosted on Kaggle. The title suggests it contains annotations generated by the LLaVA (Large Language and Vision Assistant) model for the PASCAL VOC object detection and segmentation benchmark. The dataset likely provides question-answer pairs or descriptive labels for images, linking visual content with language.

MultimodalPascal VocComputer VisionObject Detection+1

0 views

Multimodal & LLM

RankVideo: Reasoning Reranking for Text-to-Video Retrieval

A collection of training and evaluation files derived from the MultiVENT 2.0 benchmark for text-to-video retrieval. The dataset provides structured query-video pairs within training_data.json designed to facilitate explicit reasoning over video content for relevance assessment.

JSONSize Categories10 Kn100 KLibrarypolarsModalitytextModalitytabularLibrarymlcroissantTask Categoriesvideo Text To TextLibrarydatasetsLibrarypandasRegionusArxiv260202444Licensemit+1

0 views

Multimodal & LLM

PhoStream: 5,572 QA Pairs for Mobile Omnimodal Streaming Benchmarking

PhoStream contains 5,572 open-ended QA pairs derived from 578 videos across 4 scenarios and 10 capabilities, released by lucky-lance in 2026. This benchmark evaluates omnimodal assistants in mobile-centric streaming environments, focusing on both on-screen and off-screen phone usage. It specifically tests a model's ability to determine both the timing and the content of responses while processing continuous audio-visual streams.

VideoMultimodalLanguageenTask Categoriesvideo Text To TextLicensecc By 40ModalityvideoStreamingRegionusReasoning+1

0 views

Multimodal & LLM

ToS: Theory of Space Visual Scene Dataset

Highlighting pre-rendered 3D multi-room environments categorized for evaluating spatial reasoning in Vision Language Models. It provides structured visual scene data to support the Theory of Space benchmark, focusing on active exploration and the construction of spatial beliefs.

Size Categories10 Kn100 KTask Categoriesimage To TextSpatial Reasoning3d-scenesLanguageenTask Categoriesvisual Question AnsweringTask CategoriesroboticsVision LanguageModalityimageBenchmarkLicensecc By 40RegionusArxiv260207055+1

0 views

Multimodal & LLM

MiniVLM: A Vision-Language Model Dataset

Kaggle hosts the MiniVLM dataset, which is likely related to vision-language modeling. The dataset's specific content, size, and creation details are not provided in the available metadata.

MultimodalVision Language ModelMultimodal AiComputer VisionNatural Language Processing+1

0 views

Multimodal & LLM

bmps_vlm: Vision-Language Model Training Data

A dataset titled 'bmps_vlm' published on Kaggle. The title suggests it is likely related to vision-language models, a subfield of multimodal AI. No further metadata is available to confirm its specific contents, size, or origin.

MultimodalVision Language ModelMultimodal AiComputer VisionNatural Language Processing+1

0 views

Multimodal & LLM

EngVQA-GRPO: English Visual Question Answering Dataset

EngVQA-GRPO is a dataset hosted on Kaggle. The title suggests it likely contains English-language visual question answering data. The dataset's specific content, size, and origin are unknown from the provided metadata.

MultimodalEnglish LanguageVisual Question Answering+1

0 views

Multimodal & LLM

gvlma: Global Validation of Linear Models Assumptions

gvlma is an R package for the global validation of linear model assumptions, authored by Edsel A. Pena and Elizabeth H. Slate. The dataset likely contains statistical test results and diagnostic metrics for assessing model fit. It is sourced from the paperswithcode platform, which aggregates resources for the computer science and mathematics communities.

TabularR PackageEconometricsComputer ScienceMathematicsModel validationStatistical ModelingLinear Models+1

0 views

Multimodal & LLM

Real Classroom Sessions with Multimodal Teaching Features

236 real classroom sessions provide data across 5 modalities and 25 features. The dataset is designed for human-AI collaborative analysis of teaching quality. Its origin and creation date are unknown.

MultimodalTeaching QualityHuman Ai CollaborationEducation+1

0 views

Multimodal & LLM

BEAT2 Aligned Multimodal Features

BEAT2 Aligned Multimodal Features is a dataset hosted on Kaggle. The title suggests it contains features extracted from multiple data modalities that have been aligned. The dataset's specific content, size, and provenance are unknown.

MultimodalMachine LearningMultimodal FeaturesAligned Data+1

0 views

Multimodal & LLM

IMDB Multimodal: Movie Data with Multiple Media Types

IMDB multimodal likely contains data from the Internet Movie Database, combining multiple types of media. The dataset is published on Kaggle, but its specific content, size, and creation details are unknown. Its last update date and authorship are not provided.

MultimodalMoviesImdbEntertainment+1

0 views

Multimodal & LLM

ROCO Multimodal 4 Clusters Dataset

ROCO Multimodal 4 Clusters Dataset is a dataset hosted on Kaggle. The title suggests it contains multimodal data organized into four clusters. The dataset likely contains data from multiple modalities, such as images and text, intended for clustering tasks.

MultimodalMachine LearningClustering+1

0 views

Multimodal & LLM

ROCO Multimodal 4 Clusters Dataset

A multimodal dataset from Kaggle, likely containing data organized into four clusters. The dataset's title suggests it may combine different data types such as images and text. Specific details regarding its size, creation date, and authorship are not provided in the available metadata.

MultimodalMachine LearningComputer VisionClustering+1

0 views

Multimodal & LLM

Training: Streaming LLM Training Script with Unsloth and FineWeb-2 Data

A script for streaming large language model training, authored by uv-scripts and last updated in January 2026. It demonstrates training a Qwen model on Latin using 1.47 million texts streamed directly from the FineWeb-2 dataset on Hugging Face Hub. The associated blog post details the method for training on massive datasets without local downloads.

TextUv ScriptFine TuningLarge Language ModelTrainingStreamingRegionusLarge ScaleUnslothText Corpus+1

0 views

Multimodal & LLM

LLaVA-CC3M-Pretrain-595K-ZH: Chinese Machine-Translated Multimodal Training Data

A Chinese-language dataset of 595,000 items, created by machine-translating the LLaVA-CC3M-Pretrain-595K dataset. The data was uploaded by author 'cyberlangke' to Hugging Face and last updated on 2026-02-25. The description notes the translations are unverified and may contain errors.

MultimodalJSONMachine TranslationLibrarypolarsLanguagezhTask Categoriesvisual Question AnsweringModalitytextSize Categories100 Kn1 MLibrarymlcroissantMultimodal LlmLibrarydatasetsLibrarypandasRegionusChinese LanguageLicenseapache 20Visual Question AnsweringPretraining Data+1

0 views

Multimodal & LLM

Wine Bottle Images Linked to Detailed Wine Information

107,821 wine bottle images scraped from wine retailer websites, linked to a companion text dataset. The dataset was created by cipher982 and last updated on January 21, 2026. It is intended for computer vision and multimodal machine learning tasks.

ImageMultimodalProduct ImagesComputer VisionRetailWine+1

0 views

Multimodal & LLM

Elderly Fall Detection Dataset with Multimodal Images

Multimodal images for AI-based fall recognition, published on Kaggle. The dataset likely contains visual data intended for training models to detect falls in elderly individuals. Specific details on volume, collection method, and authorship are not provided in the available metadata.

MultimodalHealth MonitoringComputer VisionElderly CareFall Detection+1

0 views

Multimodal & LLM

OmniCap-400M: 400 Million Image-Caption Pairs from the Web

400 million diverse image-caption pairs collected from the open web support multimodal AI research. The dataset includes rich metadata for filtering and deduplication. It was created by Ajax102 and last updated on Hugging Face in January 2026.

MultimodalVision LanguageWeb CrawledImage TextComputer VisionLarge Scale+1

0 views

PreviousPage 50 of 98Next