DataSalon

Discover quality datasets for AI training — aggregated from 40+ platforms, curated by AI.

ProductSearch Datasets Browse Topics Rankings Community API / MCP

ResourcesDocumentation Blog Changelog Status

LegalPrivacy Policy Terms of Service Cookie Policy

Multimodal & LLM Datasets | DataSalon

All Categories

🔗

Multimodal & LLM

Image-text pairs, instruction tuning, visual QA, cross-modal data, foundation model training data

1,947 datasets

Multimodal & LLM

Multimodal Best: AI Training Data Collection

A dataset titled 'multimodal-best' published on Kaggle. The dataset's specific content, size, and creation details are not provided in the available metadata. Its nature must be verified by downloading and inspecting the actual files.

MultimodalMachine LearningAi Training+1

0 views

Multimodal & LLM

Chemical Reaction Visual Question Answering Benchmark With 1,525 Questions

RxnBench (SF-QA) is a visual question answering benchmark containing 1,525 multiple-choice questions at the PhD-level of organic chemistry. The benchmark is built from 305 scientific figures drawn from high-impact OpenAssess journals, with domain experts designing five questions per figure.

ParquetSize Categories1 Kn10 KArxiv241111098LibrarypolarsLanguagezhLanguageenTask Categoriesvisual Question AnsweringLicensecc By Nc Sa 40ModalitytextArxiv251223565LibrarymlcroissantModalityimageLibrarydatasetsLibrarypandasChemistryRegionus+1

0 views

Multimodal & LLM

Two-Box Judge GUI: A Multimodal Dataset for GUI Element Selection Models

Two-Box Judge GUI Dataset (Sharded) is a multimodal dataset for training GUI element selection models, packaged in WebDataset format for efficient streaming. The dataset contains 115,638 training samples and 12,849 validation samples, totaling over 28 GB across 7 shards. It was created by author Micasa997 and last updated on February 4, 2026.

MultimodalWebdataset FormatMultimodal TrainingComputer VisionGui Element Selection+1

0 views

Multimodal & LLM

LLaVA-LoRA-nil-final-weights: Fine-Tuned Vision-Language Model Weights

A set of final model weights for the LLaVA (Large Language-and-Vision Assistant) model, fine-tuned using Low-Rank Adaptation (LoRA). The weights are hosted on Kaggle, but the specific architecture, training data, and performance metrics are not detailed in the available metadata. The dataset's author, organization, and last update date are unknown.

MultimodalFine Tuning WeightsVision Language ModelLarge Language Model+1

0 views

Multimodal & LLM

eMotions: Large-Scale Emotion Analysis for Short-Form Videos

Wu et al. introduced the eMotions dataset in 2025 for emotion analysis within short-form video contexts. While the metadata indicates a text modality, the dataset is designed as a large-scale resource for the ACM ICMR'25 paper 'Towards Emotion Analysis in Short-form Videos.'

JSONLibrarypolarsSize Categoriesn1 KModalitytextLibrarymlcroissantLibrarydatasetsLibrarypandasRegionusLicenseapache 20+1

0 views

Multimodal & LLM

Multimodal Chips V3: Data on Computer Hardware

A dataset titled 'multimodal-chips-v3' hosted on Kaggle. The title suggests the data relates to computer chips or hardware components, potentially integrating multiple data types. No further metadata, such as author, size, or description, is provided.

MultimodalElectronicsHardwareComputer Chips+1

0 views

Multimodal & LLM

MathNet: A Global Multimodal Benchmark for Mathematical Reasoning

MathNet is the official implementation for a benchmark presented at ICLR 2026. It is a global multimodal benchmark designed for evaluating mathematical reasoning and retrieval tasks. The repository was created by ShadeAlsha and last updated on April 21, 2026.

MultimodalMathematical ReasoningAi EvaluationBenchmarkRetrievalMultimodal Benchmark+1

0 views

Multimodal & LLM

Animation Character Design Dataset for Multimodal AI Emotion Analysis

Animation Character Design Dataset is a multimodal collection hosted on Kaggle. The raw description indicates it is focused on emotion, suggesting it likely contains visual and potentially textual data related to animated characters. Metadata is minimal; actual content requires verification after download.

MultimodalMultimodal AiAnimationEmotionCharacter Design+1

0 views

Multimodal & LLM

BrowseComp-V3: A Benchmark for Multimodal Browsing Agents with 300 Samples

BrowseComp-V3 is a benchmark dataset containing 300 samples for evaluating multimodal browsing agents. It includes encrypted question-answer pairs, images, search trajectories, and sub-goals. The dataset was created by Halcyon-Zhang and last updated on February 13, —.

MultimodalJSONLibrarypolarsEncrypted DataSize Categoriesn1 KModalitytextLibrarymlcroissantModalityimageLibrarydatasetsBenchmarkLibrarypandasRegionusMultimodal BenchmarkBrowsing AgentsSearch Trajectories+1

0 views

Multimodal & LLM

Gretel Safety Alignment Dataset with 8,361 Prompt-Response Triplets

8,361 curated triplets of prompts, responses, and safe responses across various risk categories. The dataset includes safety scores, judge reasoning, and harm probability assessments. It was created by Gretel.ai and is available under the Apache License 2.0.

TextMachine LearningAi SafetyLanguage Model AlignmentNatural Language ProcessingPrompt Response TripletsSynthetic DataSynthetic+1

0 views

Multimodal & LLM

PMC-VQA: Medical Visual Question Answering Dataset

PMC-VQA is a dataset for medical visual question answering, likely containing pairs of medical images and related questions. It is hosted on Kaggle, but detailed metadata such as the creator, size, and specific contents are not provided. The dataset's purpose is inferred to be for training and evaluating AI models on medical image-text understanding tasks.

MultimodalMedical ImagingVision LanguageMultimodal AiMedical Vqa+1

0 views

Multimodal & LLM

D.Html: Document Images with Structured HTML and Markdown Markup

D.Html contains fewer than 1,000 document page images paired with structured HTML and Markdown markup for OCR and reconstruction tasks. Developed by prithivMLmods and updated in March 2026, the collection focuses on preserving document hierarchies like headings and paragraphs.

OPTIMIZED-PARQUETParquetDynamic HtmlTask Categoriesimage Text To TextLibrarypolarsDoi1057967hf7967Task Categoriesimage To TextLanguageenSize Categoriesn1 KModalitytextCodeLibrarymlcroissantHtmlModalityimageLibrarydatasetsLibrarypandasRegionusOCRLicenseapache 20+1

0 views

Multimodal & LLM

MM-Corel-10K: Multimodal Image-Text Dataset for CBIR

A multimodal dataset containing images and associated text, likely for Content-Based Image Retrieval (CBIR) research. It is hosted on Kaggle, but specific details like size, author, and update date are not provided in the available metadata. The dataset's content and structure require verification after download.

MultimodalCorel 10kImage TextComputer VisionCbir+1

0 views

Multimodal & LLM

MM-GHIM-10K: A Multimodal Image-Text Dataset for Content-Based Image Retrieval

MM-GHIM-10K is a multimodal dataset containing paired image and text data, intended for Content-Based Image Retrieval (CBIR) research. The dataset is published on Kaggle, but its specific size, creation date, and authorship are not detailed in the provided metadata. Its content likely consists of 10,000 items, as suggested by the '10K' in its title, though this requires verification.

MultimodalImage TextComputer VisionCbir+1

0 views

Multimodal & LLM

Libero90 VLM Features: Vision-Language Model Feature Set

Libero90 VLM Features is a dataset uploaded to HuggingFace by user 'arif101'. The dataset's title suggests it contains extracted features for vision-language model tasks, likely related to the LIBERO benchmark. The dataset was last updated on April 12, 2026.

MultimodalMachine LearningVision Language ModelMultimodal Features+1

0 views

Multimodal & LLM

EVP Multimodal: Data from Microsoft and SpaceX

EVP Multimodal at Microsoft and SpaceX is a dataset hosted on Kaggle. The dataset's title suggests it contains multimodal data, likely combining image, text, or other data types, from the two named organizations. Specific details on content, size, and collection methods are unavailable from the provided metadata.

MultimodalMicrosoftSatellite ImageryMultimodal AiComputer VisionSpacex+1

0 views

Multimodal & LLM

Multimodal Crypto Features v5

Multimodal Crypto Features v5 is a dataset hosted on Kaggle. Its title suggests it contains multiple types of data features related to cryptocurrencies. The specific content, scale, and origin are not detailed in the available metadata.

MultimodalCryptocurrencyMultimodal FeaturesFinanceFinancial Data+1

0 views

Multimodal & LLM

Zebra-CoT: 182,384 Interleaved Vision-Language Reasoning Traces

Zebra-CoT is a large-scale dataset containing 182,384 samples of logically coherent interleaved text and image reasoning traces. It was created by multimodal-reasoning-lab and covers four major categories: scientific reasoning, 2D visual reasoning, 3D visual reasoning, and visual logic & strategic games. The dataset was last updated on Hugging Face in January 2026.

MultimodalVision Language ReasoningComputer VisionMultimodal ReasoningLarge ScaleLogic GamesScientific ReasoningVisual Reasoning+1

0 views

Multimodal & LLM

PanoEnv-QA: A Large-Scale Panoramic Visual Question Answering Benchmark

Over 14.8K questions are included in the PanoEnv-QA benchmark, designed to probe 3D spatial intelligence on panoramic images. It is built from synthetic, photorealistic 3D environments sourced from TartanAir. The dataset was created by author 7zkk and was last updated on February 24,我们发现了一个问题，输入中的日期是2026-02-24，这是一个未来的日期。根据事实性协议，我不能直接陈述这个未来的日期作为事实。我将使用“last updated date is listed as 2026-02-24”来引用输入中的直接事实。 2026.

MultimodalPanoramic Imagery3d Spatial IntelligenceBenchmarkLarge ScaleVisual Question AnsweringSynthetic EnvironmentsSynthetic+1

0 views

Multimodal & LLM

MicroLens VQA: 93,014 Microscopy Image-Question-Answer Triples

MicroLens VQA provides 93,014 triples of microscopy images paired with questions and answers for fine-tuning vision-language models. The dataset appears to be sourced from Kaggle, but its author, organization, and specific collection methodology are unknown. Its last update date and geographic scope are also unspecified.

MultimodalMultimodal AiComputer VisionMicroscopyNatural Language ProcessingVisual Question Answering+1

0 views

PreviousPage 40 of 97Next