DataSalon

Discover quality datasets for AI training — aggregated from 40+ platforms, curated by AI.

ProductSearch Datasets Browse Topics Rankings Community API / MCP

ResourcesDocumentation Blog Changelog Status

LegalPrivacy Policy Terms of Service Cookie Policy

Multimodal & LLM Datasets | DataSalon

All Categories

🔗

Multimodal & LLM

Image-text pairs, instruction tuning, visual QA, cross-modal data, foundation model training data

1,956 datasets

Multimodal & LLM

Robotics Video Question Answering Dataset

RoboVQA contains video and text data for training models to answer questions about robotic scenes. The dataset includes over 100,000 entries, as indicated by its Hugging Face size category. It was created by Tianli and last updated in July 2025.

VideoMultimodalJSONMachine LearningRegion:usVision LanguageModality:videoMultimodal AiLibrary:datasetsLibrary:daskRoboticsModality:textLibrary:mlcroissantSize_categories:100k<n<1mVideo Question Answering+1

0 views

Multimodal & LLM

Robo2VLM-Reasoning: Chain-of-Thought for Robotic VQA

Giving access to reasoning traces generated by Gemini-2.5-pro for the Robo2VLM-1 visual question answering benchmark. It contains logical, step-by-step explanations that justify correct answers for robotic manipulation tasks across diverse, in-the-wild environments.

ParquetSize Categories1 Kn10 KLibrarypolarsLibrarydaskLanguageenArxiv250515517Task Categoriesvisual Question AnsweringModalitytextLibrarymlcroissantVision LanguageModalityimageLibrarydatasetsRoboticsRegionusLicenseapache 20+1

0 views

Multimodal & LLM

BLIP3o-60k: GPT-4o Distilled Text-to-Image Instruction Dataset

BLIP3o-60k is a dataset distilled from GPT-4o for instruction tuning of text-to-image models. It includes categories such as JourneyDB, human-centric data from MSCOCO, Dalle3 outputs, Geneval, common objects, and simple text. The dataset was created by BLIP3o and last updated on May 25, 2025.

MultimodalGpt 4oText To ImageComputer Vision+1

0 views

Multimodal & LLM

Therapeutics Data Commons: Multimodal Benchmarks for Drug Discovery

Therapeutics Data Commons (TDC) is a collection of multimodal benchmarks and datasets for drug discovery and therapeutic science developed by the Harvard MIMS group. Updated as recently as July 2025, it provides a standardized framework for evaluating machine learning models across the drug development pipeline.

MedicineMachine LearningTherapeuticsBiologyBenchmarksBiotechBioinformaticsArtificial IntelligenceDrug DiscoveryChemistryCheminformaticsPrecision MedicineDeep LearningBiomedicine+1

0 views

Multimodal & LLM

LLaVA-OneVision Multimodal Instruction Data

A 2024-09-01 upload of filtered VisualWebInstruct data for the OneVision training stage. The dataset, created by lmms-lab, contains subsets like ureader_kg and ureader_qa, provided as processed JSON files and compressed image folders.

MultimodalParquetImage Text PairsLibrarypolarsLanguagezhLibrarydaskSize Categories1 Mn10 MLanguageenModalitytextLibrarymlcroissantVision LanguageMultimodal AiModalityimageLibrarydatasetsArxiv240803326Computer VisionRegionusInstruction TuningLicenseapache 20+1

0 views

Multimodal & LLM

Chasm Covert Advertisement On Rednote

4,992 social media posts from the RedNote platform categorized into 613 advertisement and 4,379 non-advertisement samples. The dataset includes 26,324 associated images distributed across training, validation, and test splits for covert marketing detection.

MultimodalParquetSize Categories1 Kn10 KLibrarypolarsLanguagezhModalitytextCovert Advertisement DetectionModalitytabularLibrarymlcroissantSocial MediaImage TextLibrarydatasetsLibrarypandasRegionusRed NoteLicensemitXiaohongshu+1

0 views

Multimodal & LLM

VQA Multitask: Visual Question Answering Dataset

Vqa Multitask is a dataset for multitask learning, likely combining visual and textual data for question answering. It was published on huggingface by author WaltonFuture and was last updated on July 9, 2025. The specific content, scale, and structure require verification after download.

MultimodalComputer VisionNatural Language ProcessingMultitask LearningVisual Question Answering+1

0 views

Multimodal & LLM

Fixtures Docvqa: Document VQA Test Images for LayoutLMv2

2 document images from the DocVQA dataset serve as fixtures for the HuggingFace Transformers library. These samples facilitate the testing of LayoutLMv2FeatureExtractor and LayoutLMv2Processor across specific unit test files.

ParquetLibrarypolarsSize Categoriesn1 KModalitytextLibrarymlcroissantModalityimageLibrarydatasetsLibrarypandasRegionus+1

0 views

Multimodal & LLM

SB-Bench: Stereotype Bias Benchmark for Large Multimodal Models

A benchmark dataset comprising over 14,500 questions on non-synthetic images, created to assess stereotype biases in Large Multimodal Models (LMMs). The dataset, authored by ucf-crcv, was last updated on May 16, 2025. It spans nine diverse domains and 54 sub-domains to rigorously evaluate LMM performance in visually grounded stereotypical scenarios.

MultimodalAi FairnessBenchmarkComputer VisionStereotype BiasMultimodal BenchmarkSynthetic+1

0 views

Multimodal & LLM

MSR-VTT: 10,000 Video Clips with 200,000 Captions for Text-Video Retrieval

MSR-VTT is a benchmark dataset for text-video retrieval, containing 10,000 video clips and 200,000 captions. It was introduced in the 2016 paper 'MSR-VTT: A large video description dataset for bridging video and language' and is hosted on Hugging Face by user friedrichor. The dataset uses a standard 1K-A split protocol with training sets of 7,010 and 9,000 videos and a test set of 1,000 videos.

TextVideoBenchmarkVideo CaptioningMultimodal BenchmarkText Video Retrieval+1

0 views

Multimodal & LLM

VS-Bench: Vision-Language Model Benchmark for Multi-Agent Environments

VS-Bench is a multimodal benchmark for evaluating Vision-Language Models in multi-agent environments. The benchmark evaluates fourteen state-of-the-art models across eight vision-grounded environments using two complementary dimensions. It was created by author zelaix and last updated on June 4, 2025.

MultimodalStrategic ReasoningMulti Agent EnvironmentsBenchmark EvaluationVision Language ModelsBenchmarkComputer Vision+1

0 views

Multimodal & LLM

OpenDocVQA: A Unified Corpus for Document Visual Question Answering

A training and evaluation corpus for VDocRAG, a retrieval-augmented generation framework designed to understand real-world documents from visual features. The dataset is a unified collection of open-domain document visual question answering data, encompassing diverse document types and formats. It was created by NTT-hil-insight and last updated on 2025-05-26.

MultimodalRag FrameworkMultimodal AiBenchmarkNatural Language ProcessingVisual Question AnsweringDocument Vqa+1

0 views

Multimodal & LLM

Geometry3K In Context Synthesizing

2,101 image-text pairs designed for unsupervised post-training of multi-modal large language models. Each entry includes a 'problem' field with a geometric reasoning question and an 'answer' field containing the corresponding solution.

ParquetSize Categories1 Kn10 KTask Categoriesimage Text To TextLibrarypolarsArxiv250522453ModalitytextLibrarymlcroissantModalityimageLibrarydatasetsLibrarypandasRegionus+1

0 views

Multimodal & LLM

Multi-Modal Geospatial Datacubes from Five Satellite Sensors

Core-Five is a multi-modal geospatial dataset built for foundation models, unifying Earth Observation data from five essential sensors into aligned spatiotemporal datacubes. It includes optical Sentinel-2 data at 10m resolution and other sensor data for multi-modal vision tasks.

Licensecc By Nc 30Task Categoriesobject DetectionTask Categoriesimage To TextLanguageenTask CategoriessummarizationSize Categories10 Mn100 MSuper ResolutionFoundation ModelsContrastive LearningGeospatial Foundation ModelTask Categoriesfeature ExtractionTask Categoriesimage ClassificationTask Categoriesimage SegmentationSpatio Temporal LearningSelf Supervised LearningModalitygeospatialRegionusTask Categoriesunconditional Image GenerationTask Categoriesimage To ImageTask Categoriestranslation+1

0 views

Multimodal & LLM

MedTrinity-25M: 25 Million Multimodal Medical Records with Multigranular Annotations

MedTrinity-25M consists of 25 million multimodal medical records featuring multigranular annotations, developed by UCSC-VLAA for ICLR 2025. The dataset provides large-scale image-text pairings designed to advance the training and evaluation of medical Multimodal Large Language Models (MLLMs).

MultimodalityMllms+1

0 views

Multimodal & LLM

Llava 3D Data: Multimodal AI Dataset

Llava 3D Data is a multimodal dataset published on HuggingFace by author ChaimZhu. The dataset was last updated on July 11, 2025. Its specific content and scale are not detailed in the available metadata.

MultimodalJSON3d DataLibrarypolarsModalitytextSize Categories100 Kn1 MLibrarymlcroissantLlavaMultimodal AiLibrarydatasetsLibrarypandasRegionusText Data+1

0 views

Multimodal & LLM

BLIP3o Pretrain Short Caption: 5 Million Images with Generated Captions

5 million images are each paired with a short caption generated by the Qwen/Qwen2.5-VL-7B-Instruct model. The dataset was created by BLIP3o and last updated on Hugging Face in May 2025. It is intended for pretraining vision-language models.

MultimodalGenerated CaptionsVision LanguagePretrainingImage CaptioningLarge ScaleSynthetic+1

0 views

Multimodal & LLM

BLIP3o Pretrain JourneyDB: 4 Million AI-Generated Images

4 million images from the JourneyDB collection, hosted by the BLIP3o organization. The dataset was last updated on May 26, 2025. It is intended for use in pretraining multimodal AI models.

ImageMultimodalMultimodal PretrainingComputer VisionImage GenerationLarge Scale+1

0 views

Multimodal & LLM

Fineweb URLs: Source URLs and Domains for LLM Training Data

A dataset created by nhagar on May 15, 2025, providing the URLs and top-level domains associated with training records in the HuggingFaceFW/fineweb dataset. It was created by downloading source data, extracting URLs and domains, and retaining only those identifiers to make exploring LLM training datasets more accessible.

TabularParquetTask Categoriestext GenerationLibrarypolarsLibrarydaskLanguageenText GenerationModalitytextWeb DataLibrarymlcroissantLibrarydatasetsDoi1057967hf5441Url ExtractionSize Categories10 Bn100 BRegionusLlm TrainingLicenseodc By+1

0 views

Multimodal & LLM

WildDoc: A Benchmark for Real-World Document Understanding by Vision-Language Models

WildDoc is a dataset created by ByteDance to evaluate the document understanding capabilities of vision-language models in real-world scenarios. It is designed to facilitate the understanding of documents in the wild, as described on its project homepage. The dataset was last updated on May 19, 2025.

MultimodalDocument UnderstandingVision Language ModelsEvaluation BenchmarkReal World Documents+1

0 views

PreviousPage 74 of 98Next