DataSalon

Discover quality datasets for AI training — aggregated from 40+ platforms, curated by AI.

ProductSearch Datasets Browse Topics Rankings Community API / MCP

ResourcesDocumentation Blog Changelog Status

LegalPrivacy Policy Terms of Service Cookie Policy

Multimodal & LLM Datasets | DataSalon

All Categories

🔗

Multimodal & LLM

Image-text pairs, instruction tuning, visual QA, cross-modal data, foundation model training data

1,948 datasets

Multimodal & LLM

Nusantara Herbal-Med: Visual Question Answering for Traditional Medicine

A multimodal dataset likely containing images and text related to traditional herbal medicine from the Nusantara region. The dataset appears designed for Visual Question Answering (VQA) tasks, where models must answer questions about visual content. It is hosted on Kaggle, but detailed metadata such as size, author, and license are currently unknown.

MultimodalHerbal MedicineMedical AiTraditional MedicineVisual Question Answering+1

0 views

Multimodal & LLM

SurveillanceVQA-589K: Visual Question Answering Dataset for Surveillance Video

SurveillanceVQA-589K is a large-scale dataset for visual question answering tasks, likely derived from surveillance footage. The dataset is hosted on Kaggle and appears to be a testing subset of a larger collection. Its specific content, such as the number of video clips or question-answer pairs, requires verification after download.

MultimodalVideo AnalysisSurveillanceMultimodal AiVisual Question Answering+1

0 views

Multimodal & LLM

VQA-Rank8: Visual Question Answering Ranking Dataset

Kaggle hosts the VQA-Rank8 dataset. The title suggests it is likely related to ranking tasks within the domain of visual question answering. No further metadata is available to confirm its specific content, size, or origin.

MultimodalRankingMultimodal AiVisual Question Answering+1

0 views

Multimodal & LLM

Multimodal CSI Text: Wireless Sensing Data with Text Annotations

Multimodal_CSI_Text is a dataset published on Kaggle. The title suggests it contains Channel State Information (CSI) data, a type of wireless signal measurement, paired with text annotations. The dataset's specific content, scale, and collection details are not provided in the available metadata.

TextMultimodalSensor DataCsi+1

0 views

Multimodal & LLM

JEE-SFT: Visual Reasoning and Explanation Dataset for STEM Problems

JEE-SFT is a multimodal instruction-tuning dataset designed to teach Vision Language Models step-by-step reasoning for complex STEM problems. The dataset focuses on the solution process and includes both Multiple Choice Questions and Numerical Value Questions, with a key feature being its 'Text-Only Reasoning' filter. It was created by author farhananis005 and was last updated on February 4, 2026.

MultimodalMultimodal LearningComputer VisionStem EducationInstruction TuningVisual Reasoning+1

0 views

Multimodal & LLM

Quran Verse-Level Multimodal Dataset

Quran-MD integrates textual, linguistic, and audio dimensions at the verse (ayah) and word levels. The dataset was created by Buraaq and the associated paper was accepted at the 5th Muslims in ML Workshop at NeurIPS 2025. It is part of a larger, complete Quran-MD collection.

AudioMultimodalArabicOPTIMIZED-PARQUETParquetLibrarypolarsLibrarydaskQuranArxiv260117880Religious TextModalitytextSize Categories100 Kn1 MLibrarymlcroissantLibrarydatasetsRegionusLinguistics+1

0 views

Multimodal & LLM

CiteVQA: Visual Question Answering with Citation Grounding

CiteVQA is a dataset published on Kaggle. Its title suggests a focus on visual question answering tasks that require grounding answers in citations or references. The dataset's specific content, size, and origin require verification after download due to minimal provided metadata.

MultimodalMultimodal AiCitation analysisVisual Question Answering+1

0 views

Multimodal & LLM

DiverseVQA: A Visual Question Answering Dataset

DiverseVQA is a dataset likely designed for visual question answering tasks, which involve answering natural language questions about images. It is hosted on the Kaggle platform, but detailed metadata such as the number of samples, specific image sources, and creation date are not provided. The dataset's content and scale require verification after download.

MultimodalMultimodal AiComputer VisionNatural Language ProcessingVisual Question Answering+1

0 views

Multimodal & LLM

FRIEDA: 500 Multimodal Examples for Open-Ended Cartographic Reasoning

FRIEDA consists of 500 multimodal examples for open-ended cartographic reasoning, developed by knowledge-computing and released in late 2025. The benchmark pairs real-world map images with natural-language questions and reference answers to evaluate spatial reasoning capabilities.

OPTIMIZED-PARQUETParquetLibrarypolarsLanguageenTask Categoriesvisual Question AnsweringSize Categoriesn1 KModalitytextLibrarymlcroissantArxiv251208016ModalityimageLibrarydatasetsLibrarypandasRegionus+1

0 views

Multimodal & LLM

Sarthi Agridata: 220,222 Hindi Agri-Advisory Synthetic Examples

220,222 synthetic data points for agricultural advisory tasks, generated by Google Gemini 2.5 Flash. The dataset is designed for instruction tuning and chain-of-thought reasoning, with Hindi as the target output language and English used for internal reasoning and metadata. Soketlabs published the dataset on Hugging Face, with a last update recorded on January 16, 2026.

TextHindiChain Of ThoughtAgricultureInstruction TuningSynthetic DataSynthetic+1

0 views

Multimodal & LLM

VQA Data: Visual Question Answering Dataset

A dataset for Visual Question Answering tasks, published on Kaggle. The dataset likely contains paired images and text questions with corresponding answers. Specific details on size, author, and last update are unknown.

MultimodalComputer VisionNatural Language ProcessingVisual Question Answering+1

0 views

Multimodal & LLM

vqa-cv-rank8-32: Visual Question Answering Ranking Data

A dataset named 'vqa-cv-rank8-32' published on Kaggle. Its title suggests a connection to Visual Question Answering (VQA) and ranking tasks, likely containing image-text pairs with ranking labels. The dataset's author, organization, size, and specific contents are unknown.

MultimodalMachine LearningRankingComputer VisionVisual Question Answering+1

0 views

Multimodal & LLM

PathVQA-Turkish-Text: Turkish Visual Question Answering for Medical Pathology

PathVQA-Turkish-Text is a dataset published on Kaggle. The title and platform tags suggest it likely contains Turkish-language text data associated with medical imagery for visual question answering tasks. The dataset's specific content, size, and provenance require verification after download.

MultimodalTurkish LanguageMultimodal QaMedical ImageryVisual Question Answering+1

0 views

Multimodal & LLM

VisCoT VStar Collage: Multimodal QA Data for Visual Search Training

m-Just's dataset comprises collages with a randomly placed 'core' image and a corresponding question-answer pair. This data was used to train the vSearcher model introduced in the research paper 'InSight-o3: Empowering Multimodal Foundation Models with Generalized Visual Search'. The dataset was last updated on Hugging Face in January 2026.

MultimodalVisual SearchMultimodal QaFoundation ModelsImage CollageComputer Vision+1

0 views

Multimodal & LLM

OpenThoughts-Agent-v1-RL: 720 Tasks with Verifiers for Agentic RL

OpenThoughts-Agent-v1-RL provides approximately 720 curated reinforcement learning tasks designed for training agentic models, released by the open-thoughts project in January 2026. The collection includes instructions, environment configurations, and verifiers specifically optimized for benchmarks like Terminal-Bench 2.0 and SWE-Bench.

ParquetLibrarypolarsSize Categoriesn1 KModalitytextLibrarymlcroissantLibrarydatasetsLibrarypandasRegionus+1

0 views

Multimodal & LLM

InspecSafe-V1: Multimodal Robot Inspection Data from Industrial Sites

InspecSafe-V1 is a high-quality, multimodal annotated dataset for world model construction in industrial environments. The data was collected from real-world inspection robots deployed across industrial sites and has been cleaned and standardized. The dataset covers five representative industrial settings, including tunnels and power facilities.

MultimodalSource DatasetsoriginalLanguageenModalityimageWorld ModelingRoboticsIndustrial SceneInspectionRegionusMultiple ModalityLicensemitCustom Dataset+1

0 views

Multimodal & LLM

SpiderMass MS and MSMS Spectra for Ovarian Cancer Typing

Raw mass spectrometry (MS) and tandem mass spectrometry (MSMS) spectra used for ex vivo ovarian cancer typing and immunoscoring. Developed by Léa Ledoux and hosted on Harvard Dataverse, the data supports surgical decision-making through multimodal machine learning. The collection was last updated in March 2026.

ChemistryMedicine Health And Life Sciences+1

0 views

Multimodal & LLM

SignVLM1: A Vision-Language Model Dataset for Sign Language

A dataset named SignVLM1, published on Kaggle. The title suggests it is likely related to sign language and vision-language models. Metadata is minimal; actual content requires verification after download.

MultimodalVision Language ModelSign Language+1

0 views

Multimodal & LLM

Structured3D: Panorama Dataset with BLIP3 Text Captions

Structured3D is a dataset of panoramic indoor scene images paired with text captions generated by the BLIP3 model. The dataset was created by KevinHuang and was last updated on February 5, 2026. The description notes missing caption files for several specific scene paths, indicating potential data completeness issues.

MultimodalComputer VisionImage CaptioningScene UnderstandingPanorama Images+1

0 views

Multimodal & LLM

Baseline-VQA-CV: A Visual Question Answering Benchmark Dataset

A dataset titled 'baseline-vqa-cv' is hosted on Kaggle. The dataset likely contains image-text pairs for visual question answering tasks, a common benchmark in computer vision and AI. Its specific content, scale, and authorship require verification after download.

MultimodalMultimodal AiBenchmarkComputer VisionVisual Question Answering+1

0 views

PreviousPage 47 of 97Next