DataSalon

Discover quality datasets for AI training — aggregated from 40+ platforms, curated by AI.

ProductSearch Datasets Browse Topics Rankings Community API / MCP

ResourcesDocumentation Blog Changelog Status

LegalPrivacy Policy Terms of Service Cookie Policy

Multimodal & LLM Datasets | DataSalon

All Categories

🔗

Multimodal & LLM

Image-text pairs, instruction tuning, visual QA, cross-modal data, foundation model training data

1,956 datasets

Multimodal & LLM

COCO-QA Vietnamese: 117,684 Visual Question-Answer Pairs

COCO-QA Vietnamese is a fully translated Vietnamese version of the popular COCO-QA dataset for Visual Question Answering (VQA) tasks. It contains over 117,684 image-based question-answer pairs translated into Vietnamese, with answers limited to one word. The dataset was created by ThucPD and last updated on June 8, -2025.

MultimodalMultimodal AiComputer VisionNatural Language ProcessingVietnamese LanguageVisual Question Answering+1

0 views

Multimodal & LLM

BigDocs-7.5M: Permissively-Licensed Multimodal Training Data for Documents and Code

BigDocs-7.5M is a dataset created by ServiceNow for training multimodal models on document and code tasks, as described in the associated arXiv paper. The dataset was last updated on June 20, 2025, and is hosted on Hugging Face. It appears to contain both text and image data, with some parts distributed using an image identifier column that requires a provided script to reconstruct.

MultimodalParquetLibrarypolarsLibrarydaskTraining DataSize Categories1 Mn10 MModalitytextLibrarymlcroissantModalityimageLibrarydatasetsLicensecc By 40Permissive LicenseComputer VisionArxiv241204626RegionusMultimodal DocumentsText DataCode Tasks+1

0 views

Multimodal & LLM

Nemotron Content Safety V2: 33,416 Annotated Human-LLM Interactions

NVIDIA's Nemotron Content Safety Dataset V2 contains 33,416 annotated interactions between humans and LLMs, released in June 2025. It provides structured training, validation, and test splits curated from human preference data to support safety alignment and toxicity detection.

JSONSize Categories10 Kn100 KSafetyLibrarypolarsToxicity DetectionModalitytextAegisLibrarymlcroissantNemoguardLibrarydatasetsLibrarypandasNemotronRegionusContent ModerationTask Categoriestext Classification+1

0 views

Multimodal & LLM

VisText: Semantic Chart Captioning Benchmark from MIT

Developed by the MIT Visualization Group (mitvis) and updated in 2025, VisText is a benchmark dataset for chart captioning. It provides paired chart images and captions to evaluate how models interpret visual data representations.

T5ChartsCaptioning ImagesCaptioning+1

0 views

Multimodal & LLM

Simple Image Captions Dataset for Multimodal Models

Simple Image Captions provides a collection of image-text pairs for multimodal tasks. The dataset contains at least 1,000 entries, as indicated by its size category, and was uploaded by user 'uygarkurt' to Hugging Face in August 2025.

MultimodalCSVLibrarypolarsSize Categoriesn1 KModalitytextLibrarymlcroissantModalityimageLibrarydatasetsLibrarypandasComputer VisionImage CaptioningRegionusNatural Language Processing+1

0 views

Multimodal & LLM

EEE-Bench: 2,860 Multimodal Electrical Engineering Problems

EEE-Bench is a multimodal benchmark comprising 2,860 problems across 10 electrical and electronics engineering subdomains, including analog circuits and control systems. It was created by afdsafas and last updated on June 23, 2025. The benchmark is designed to evaluate the practical engineering capabilities of large multimodal models using complex visual inputs.

MultimodalParquetSize Categories1 Kn10 KLanguage Creatorsexpert GeneratedLibrarypolarsTask Categoriesmultiple ChoiceTask Categoriesquestion AnsweringArxiv241101492LanguageenTask Categoriesvisual Question AnsweringLanguage CreatorsfoundModalitytextLibrarymlcroissantModalityimageLibrarydatasetsBenchmarkLibrarypandasAnnotations CreatorsfoundMultiple ChoiceRegionusReasoningElectronics EngineeringElectrical EngineeringMultimodal BenchmarkLicensemitVisual Question AnsweringAnnotations Creatorsexpert Generated+1

0 views

Multimodal & LLM

MedGemma LLaVA-Med 10K: Medical Image-Reasoning Pairs

A collection of 10,000 medical image-reasoning pairs converted from the LLaVA-Med dataset into MedGemma's structured reasoning format. The dataset was created by author Manusinhh and last updated on June 29, 2025. Each sample contains an original medical image, step-by-step diagnostic reasoning, a final diagnosis with supporting evidence, and clinically relevant web search terms.

MultimodalMedical ImagingMultimodal AiHealthcareComputer VisionClinical DataDiagnostic Reasoning+1

0 views

Multimodal & LLM

Text2CAD: Sequential CAD Designs Generated from Text Prompts

Text2CAD is a dataset for generating sequential computer-aided design (CAD) operations from text prompts. The dataset was created by Mohammad Sadil Khan, Sankalp Sinha, Talha Uddin Sheikh, Didier Stricker, Sk Aziz Ali, and Muhammad Zeshan Afzal. The dataset page was last updated on June 11, 2025.

Time SeriesMultimodalMultimodal GenerationText To CadComputer Aided DesignCad Design+1

0 views

Multimodal & LLM

GameQA-5K: 5,000 Synthetic Multimodal Reasoning Samples for Vision-Language Models

GameQA-5K is a dataset of 5,000 training samples extracted from the larger GameQA-140K dataset. It was created by the OpenMOSS-Team and published on Hugging Face in June 2025 for use in training models via the GRPO method. The data is synthesized from game code to enhance multimodal reasoning in vision-language models.

MultimodalTask Categoriesimage Text To TextTask Categoriesquestion AnsweringLanguageenVision Language ModelsGame CodeModalityimageBenchmarkArxiv250513886RegionusMultimodal ReasoningSynthetic DataLicensemitVisual Question Answering+1

0 views

Multimodal & LLM

ARKitScenes-SpatialLM: 5,047 Indoor Scenes for Oriented Object Detection

5,047 real-world indoor scenes captured using Apple's ARKit framework, preprocessed for SpatialLM training. The dataset is formatted for oriented object bounding box detection with large language models. It was created by Gen3DF and last updated on June 30, 2025.

Point CloudMultimodalJSONSize Categories1 Kn10 KModality3dLibrarypolarsLicenseapple AmlrModalitytextLibrarymlcroissantLibrarydatasetsLibrarypandasComputer VisionObject DetectionRegionusAugmented RealityIndoor Scenes+1

0 views

Multimodal & LLM

Multi Domain VQA 20K: A Visual Question Answering Benchmark for VLMs

20,000 samples combine questions and images from three established VQA datasets: AOKVQA, Path-VQA, and TDIUC. This medium-sized benchmark is designed to test the multi-domain knowledge of vision-language models. It was created by dutta18 for educational and research purposes, with copyright retained by the original dataset owners.

MultimodalParquetSize Categories10 Kn100 KLibrarypolarsKnowledge TestingLibrarydaskLanguageenTask Categoriesvisual Question AnsweringModalitytextLibrarymlcroissantModalityimageEducational ResearchLibrarydatasetsRegionusLicenseapache 20Multimodal BenchmarkVisual Question AnsweringMedical+1

0 views

Multimodal & LLM

Web Screenshots with Annotated Instructions and Click Targets

1,639 English-language web screenshots from over 100 websites are paired with natural-language instructions and pixel-level click targets. The dataset provides a high-quality benchmark for evaluating multimodal navigation models, created by Hcompany and released in June 2025.

MultimodalBenchmarkWeb NavigationMultimodal BenchmarkHuman Computer Interaction+1

0 views

Multimodal & LLM

Ring-lite-rl-data: Reinforcement Learning Dataset for Math and Code

47,400 curated problems spanning mathematics and programming domains specifically formatted for reinforcement learning. The collection includes 39,000 math problems from sources like AoPS and DeepMath-103K, alongside approximately 8,400 coding challenges.

JSONSize Categories10 Kn100 KTask Categoriestext GenerationLanguagezhLibrarydaskLanguageenModalitytextCodeLibrarymlcroissantLibrarydatasetsRegionusArxiv250614731MathLicenseapache 20+1

0 views

Multimodal & LLM

Chart-To-Code Generation Benchmark For Multimodal Models

ChartMimic evaluates visually-grounded code generation in large multimodal models using information-intensive visual charts. The dataset was created by the ChartMimic team and was last updated in June 2025.

MultimodalParquetSize Categories1 Kn10 KTask Categoriestext GenerationLibrarypolarsTask Categoriesimage To TextLibrarydaskLanguageenModalitytextLibrarymlcroissantLarge Multimodal ModelsModalityimageLibrarydatasetsChart To CodeCode GenerationRegionusArxiv240609961Task Categoriesimage To ImageLarge Language ModelsLicenseapache 20Multimodal Evaluation+1

0 views

Multimodal & LLM

MVRB: Massive Visualized Information Retrieval Benchmark

Four meta-task categories including Screenshot Retrieval (SR), Composed Screenshot Retrieval (CSR), Screenshot QA (SQA), and Open-Vocabulary form the core of this Visualized Information Retrieval (Vis-IR) benchmark. The dataset utilizes digital screenshots to unify search and information extraction tasks across diverse application scenarios.

LanguageenModalityimageRegionusArxiv250211431Licensemit+1

0 views

Multimodal & LLM

Miriad 5.8M: Medical Instruction and Retrieval QA Pairs from Literature

Miriad 5.8M contains 5.8 million medical question-answer pairs distilled from peer-reviewed biomedical literature using Large Language Models. Released in June 2025 by the Miriad research team, the dataset provides structured data for medical instruction tuning and retrieval-augmented generation. It serves as a large-scale resource for training models on verified scientific knowledge rather than general web content.

ParquetLibrarypolarsLibrarydaskSize Categories1 Mn10 MModalitytextLibrarymlcroissantLibrarydatasetsArxiv250606091Regionus+1

0 views

Multimodal & LLM

GPT-4V Generated Vision-Language Instructions

ALLaVA-4V is a multimodal dataset created by FreedomIntelligence using GPT-4V to generate detailed captions and complex reasoning question-answer pairs for images. The dataset incorporates data from sources like LAION and WizardLM, with its generation pipeline and prompts documented on the project page. It was last updated on June 8, 2025.

MultimodalVision Language InstructionQa GenerationMultimodal TrainingGpt 4v Generated+1

0 views

Multimodal & LLM

Awesome LLM and AIGC: Curated Resources for VLM, VLA, and AI4S

Coderonion curated this repository of public projects and datasets focusing on Large Language Models (LLM) and AI Generated Content (AIGC), last updated in August 2025. It aggregates links to specialized domains including Vision Language Action (VLA), AI for Science (AI4S), and specific models like DeepSeek and Qwen3.

TritonYoloAwesome ListAi4scienceAIGCCudaQwenLlamaGptAi4sQwen3MllmLarge Language ModelR1Reinforcement LearningDeepseekVlmLangchainVla+1

0 views

Multimodal & LLM

Medical Question-Answer Pairs from Biomedical Literature

MIRIAD contains 4.4 million medical question-answer pairs. The pairs were distilled from peer-reviewed biomedical literature using large language models, providing structured data for downstream tasks.

ParquetLibrarypolarsLibrarydaskSize Categories1 Mn10 MModalitytextLibrarymlcroissantLibrarydatasetsArxiv250606091Regionus+1

0 views

Multimodal & LLM

HistoPlexer-Ultivue: Multimodal Histological Images for 10 Cancer Samples

A collection of multimodal histological images from the Tumor Profiler Study. It includes whole-slide H&E images, multiplexed immunofluorescence images from Ultivue panels, alignment matrices, exclusion masks, and nuclear segmentation outputs for 10 cancer samples. The dataset was authored by CTPLab-DBE-UniBas and last updated on HuggingFace in June 2025.

ImageMultimodalCSVSize Categories10 Kn100 KCancer ResearchLibrarypolarsLibrarydaskLanguageenSpatial ProteomicsModalitytabularLibrarymlcroissantModalityimageLicensecc By Sa 40LibrarydatasetsComputer VisionRegionusPathologyHistologyMultimodal imaging+1

0 views

PreviousPage 72 of 98Next