DataSalon

Discover quality datasets for AI training — aggregated from 40+ platforms, curated by AI.

ProductSearch Datasets Browse Topics Rankings Community API / MCP

ResourcesDocumentation Blog Changelog Status

LegalPrivacy Policy Terms of Service Cookie Policy

Multimodal & LLM Datasets | DataSalon

All Categories

🔗

Multimodal & LLM

Image-text pairs, instruction tuning, visual QA, cross-modal data, foundation model training data

1,956 datasets

Multimodal & LLM

Droid 1.0.1: 95,617 Franka Robot Episodes and 27M Video Frames

Droid 1.0.1 contains 95,617 robotic episodes and 27,618,651 frames collected using Franka robots. Created by lerobot and updated in July 2025, it documents 49,611 distinct tasks at 15 FPS.

ParquetLibrarypolarsLibrarydaskSize Categories10 Mn100 MModalitytextTask CategoriesroboticsModalitytabularLibrarymlcroissantLibrarydatasetsModalityvideoRegionusLe RobotLicenseapache 20+1

0 views

Multimodal & LLM

Colpali Train Set: 127,460 Query-Image Pairs for Visual Document Retrieval

127,460 query-image pairs for visual document retrieval comprise this training set released by vidore in 2024. It combines 63% academic data from sources like DocVQA with 37% synthetic PDF pages augmented by Claude-3 Sonnet pseudo-questions.

ParquetLibrarypolarsLibrarydaskModalitytextSize Categories100 Kn1 MLibrarymlcroissantModalityimageTask Categoriesvisual Document RetrievalLibrarydatasetsTask Categoriesdocument Question AnsweringRegionusArxiv240701449+1

0 views

Multimodal & LLM

SynthChartNet: 1.9M Synthetic Charts for Document AI

1,981,157 synthetically generated chart images with ground truth annotations form this multimodal dataset. Created by the docling-project and last updated in July 2025, it is designed for training the SmolDocling model on chart-based document understanding. Charts were rendered at 120 DPI using visualization libraries like Matplotlib, Seaborn, and Pyecharts.

MultimodalMultimodal AiChart UnderstandingComputer VisionSynthetic DataDocument AiSynthetic+1

0 views

Multimodal & LLM

GameQA Text: Game-Code-Driven Reasoning Dataset

Text-only reasoning pairs and logic-based question-answer sets synthesized from game code across multiple game environments. This data utilizes game mechanics to facilitate training and evaluation of general reasoning in models via the Code2Logic framework.

Task Categoriesquestion AnsweringLanguageenArxiv250513886RegionusLicensemit+1

0 views

Multimodal & LLM

VLM-150M: A Large-Scale Recaptioned Image-Text Dataset

VLM-150M is a large-scale image-text dataset recaptioned using an SFT-enhanced Qwen2VL model to improve the alignment and detail of textual descriptions. The dataset was created by zhixiangwei and was last updated on July 28, 2025. Its repository is hosted at https://zxwei.site/hqclip/.

MultimodalRecaptioningVision LanguageImage TextMultimodal TrainingComputer VisionLarge Scale+1

0 views

Multimodal & LLM

High Quality Midjourney Srefs: 1,000+ AI Images with Moondream Captions

Between 1,000 and 10,000 AI-generated images from midjourneysref.com comprise this collection of style references and automated captions. Created by peteromallet and updated in July 2025, the records are optimized for machine learning via Parquet storage and smart cropping.

ParquetSize Categories1 Kn10 KLibrarypolarsLibrarydaskModalitytextLibrarymlcroissantModalityimageLibrarydatasetsRegionus+1

0 views

Multimodal & LLM

Synthetic Identity Documents for Vision Language Model Training

15,110 high-quality synthetic identity documents designed for fine-tuning Vision Language Models. The dataset includes realistic driver's licenses and credit cards with diverse variations in design, layout, and content, created by sugiv. It was last updated on July 20, 2025.

ImageMultimodalSize Categories10 Kn100 KTask Categoriesimage Text To TextTask Categoriesobject DetectionTask Categoriesimage To TextLanguageenTask Categoriesvisual Question AnsweringVlm TrainingIdentity DocumentsTask Categoriesdocument Question AnsweringComputer VisionCredit CardsLicensecc By Nc 40RegionusDriver LicenseOCRSynthetic DataDocument AiSynthetic+1

0 views

Multimodal & LLM

Expert Medical Question-Answering Benchmark With Multimodal Tasks

MedXpertQA is a benchmark dataset containing 4,460 questions for evaluating expert-level medical knowledge and reasoning. It was created by TsinghuaC3I and features both text-based and multimodal tasks that integrate structured clinical data with images. The dataset was last updated in July 2025.

TextMultimodalMedical QaExpert KnowledgeBenchmarkHealthcareComputer VisionClinical ReasoningMultimodal Benchmark+1

0 views

Multimodal & LLM

Git-10M: A Global-Scale Remote Sensing Dataset with Over 10 Million Image-Text Pairs

Over 10 million image-text pairs constitute this global-scale remote sensing dataset, which also includes geographical location and resolution information. The dataset was authored by 'lcybuaa' and was last updated on the Hugging Face platform in June 2025.

ImageGeospatialMultimodalImage Text PairsSatellite ImageryComputer VisionLarge Scale+1

0 views

Multimodal & LLM

ShareGPT-4o-Image: 91K AI-Generated Images for Multimodal Model Alignment

FreedomIntelligence released a dataset of 91,000 images generated by GPT-4o's image capabilities. The collection includes 45,000 text-to-image and 46,000 text-and-image-to-image samples, designed to align open multimodal models with GPT-4o's visual content creation strengths. The dataset was last updated on July 1, 2025.

MultimodalGpt 4oMultimodal AiComputer VisionImage GenerationLarge ScaleSynthetic Data+1

0 views

Multimodal & LLM

Dynvqa: A Multimodal Vision-Language Question Answering Dataset

Dynvqa is a multimodal dataset hosted on Hugging Face, authored by xandery and last updated on 2025-08-24. The dataset likely contains image-text pairs for question answering tasks, as suggested by its platform tags. The specific number of samples, column structure, and data collection methodology are not detailed in the available metadata.

ImageMultimodalParquetTextLibrarypolarsHuggingfaceSize Categoriesn1 KModalitytextLibrarymlcroissantVision LanguageModalityimageImage TextLibrarydatasetsLibrarypandasQuestion AnsweringRegionusLicensemit+1

0 views

Multimodal & LLM

INS-MMBench: A Multimodal Benchmark for Insurance AI

INS-MMBench is the first comprehensive benchmark for evaluating Large Vision-Language Models in the insurance domain. It covers four insurance types—auto, property, health, and agricultural—and key insurance stages. The dataset was created by FDU-INS and was last updated on Hugging Face in July 2025.

MultimodalSize Categories10 Kn100 KTask Categoriesquestion AnsweringLanguageenArxiv240609105BenchmarkQuestion AnsweringHealthcareInsuranceRegionusFinanceLicenseapache 20Multimodal Benchmark+1

0 views

Multimodal & LLM

GEOBench-VLM: A Benchmark for Vision-Language Models on Geospatial Tasks

A benchmark dataset designed to evaluate Vision-Language Models on tasks specific to geospatial applications. It was created by aialliance and last updated on June 30, 2025. The dataset addresses the unique complexities of geospatial data not covered by generic VLM benchmarks.

GeospatialMultimodalAi EvaluationVision Language ModelsBenchmarkComputer VisionGeospatial Analysis+1

0 views

Multimodal & LLM

Claude Code Documentation for LLM Training and RAG Systems

29 pages of documentation for Anthropic's Claude Code, crawled on 2025-06-24. The dataset contains 27,764 words formatted into 29 chunks. It was prepared by author 'ratanon' for use in LLM training and RAG systems.

TextCode AssistanceRag SystemsLlm TrainingDocumentation+1

0 views

Multimodal & LLM

MAmmoTH-VL-Instruct-12M: Multimodal Instruction Examples for Vision-Language Pre-training

MAmmoTH-VL-Instruct-12M is a dataset of interleaved multimodal examples adapted for the modality-aware continual pre-training of MoCa models. The dataset, created by moca-embed, was last updated on July 1, 2025. It is structured for visual question answering (VQA) tasks by concatenating prompts and responses.

MultimodalPre TrainingVision LanguageComputer VisionMoca ModelMultimodal VqaInstruction Tuning+1

0 views

Multimodal & LLM

DocVQA Test Subsampled: 500 Document Image QA Pairs

A manually annotated test set of 500 question-answer pairs based on document images. The data originates from the UCSF Industry Documents Library and was curated for benchmarking by subsampling the original DocVQA test set. The dataset was last updated on June 20, 2025.

MultimodalImage TextBenchmarkQuestion AnsweringDocument Vqa+1

0 views

Multimodal & LLM

BLIP3o-Pretrain-Long-Caption: 27 Million Images with Long Synthetic Captions

A collection of 27 million images, each paired with a long caption generated by the Qwen2.5-VL-7B-Instruct model. The dataset was created by the BLIP3o organization and published on Hugging Face in June 2025. It is intended for pretraining vision-language models.

MultimodalWEBDATASETMultimodal PretrainingLibrarywebdatasetSize Categories10 Mn100 MModalitytextSynthetic CaptionsLibrarymlcroissantVision LanguageModalityimageLibrarydatasetsImage CaptioningRegionusLarge ScaleLicenseapache 20Synthetic+1

0 views

Multimodal & LLM

Csvqa

CSVQA is a Chinese multimodal benchmark designed to evaluate the STEM reasoning capabilities of Vision-Language Models. The dataset was created by Skywork and its associated paper was released on arXiv in June 2025. It focuses on scientific visual question answering, combining images with text in Chinese.

MultimodalCSVSize Categories1 Kn10 KTask Categoriesimage Text To TextLibrarypolarsTask Categoriesmultiple ChoiceChinese EducationTask Categoriesvisual Question AnsweringModalitytextMathematicsLibrarymlcroissantModalityimageBiologyLibrarydatasetsBenchmarkLibrarypandasComputer VisionStem EducationChemistryRegionusChinese LanguagePhysicsScientific ReasoningArxiv250524120Multimodal BenchmarkVisual Question Answering+1

0 views

Multimodal & LLM

Sealvqa Gqa: A Visual Question Answering Dataset

Sealvqa Gqa is a dataset hosted on HuggingFace by the author dddraxxx, last updated on August 13, 2025. Its title suggests it is related to visual question answering, likely containing image-question-answer pairs. The specific content, scale, and collection methodology require verification after download.

MultimodalMultimodal AiComputer VisionNatural Language ProcessingVisual Question Answering+1

0 views

Multimodal & LLM

HalLoc Token-Level Hallucination Dataset for Vision-Language Models

Over 155,000 annotated samples comprise this dataset for localizing hallucinations in Vision-Language Models. Created by author uunicee, it spans three tasks and four hallucination types. The dataset was last updated in July 2025.

MultimodalArxiv250610286Vision Language ModelsHallucination LocalizationSize Categories100 Kn1 MComputer VisionImage CaptioningRegionusLarge ScaleVqaMultimodal Evaluation+1

0 views

PreviousPage 71 of 98Next