DataSalon

Discover quality datasets for AI training — aggregated from 40+ platforms, curated by AI.

ProductSearch Datasets Browse Topics Rankings Community API / MCP

ResourcesDocumentation Blog Changelog Status

LegalPrivacy Policy Terms of Service Cookie Policy

Multimodal & LLM Datasets | DataSalon

All Categories

🔗

Multimodal & LLM

Image-text pairs, instruction tuning, visual QA, cross-modal data, foundation model training data

1,947 datasets

Multimodal & LLM

SARLANG-1M: 1 Million SAR Image-Text Pairs from 59 Cities

SARLANG-1M is a large-scale benchmark for multimodal synthetic aperture radar image understanding. It comprises more than 1 million high-quality SAR image-text pairs collected from over 59 cities worldwide. The dataset was created by YiminJimmy and was last updated on February 10, 2026.

GeospatialMultimodalSynthetic Aperture RadarSatellite ImageryBenchmarkComputer VisionImage CaptioningLarge Scale+1

0 views

Multimodal & LLM

Zerde QA: 51,000 Kazakh Question-Answer Pairs Across 20+ Domains

51,000 Kazakh question-answer pairs designed for instruction tuning of large language models. The dataset covers more than 20 domains, making it suitable for building general-purpose Kazakh language AI assistants. It is hosted on Kaggle and is formatted for immediate use in fine-tuning tasks.

TextQuestion AnsweringLlm Fine TuningKazakh Language+1

0 views

Multimodal & LLM

Drishti-VLM-Data: Vision-Language Model Training Data

Drishti-VLM-Data is a dataset published on Kaggle. The title suggests it contains data for training or evaluating vision-language models. The dataset's specific content, size, and origin are not detailed in the available metadata.

MultimodalVision Language ModelMultimodal DataComputer Vision+1

0 views

Multimodal & LLM

ViFoodVQA: Visual Question Answering Benchmark for Vietnamese Food

ViFoodVQA is a benchmark dataset for visual question answering tasks. The dataset likely contains images of Vietnamese food paired with questions and answers. It is hosted on Kaggle, but details about its size, creation, and update history are unknown.

MultimodalBenchmarkVisual Question Answering+1

0 views

Multimodal & LLM

ANN-2026 Multimodal Challenge: Crowdfunding Outcome Data

A dataset associated with the ANN-2026 multimodal challenge, likely containing information related to crowdfunding campaigns and their outcomes. The dataset is hosted on Kaggle, but its specific contents, scale, and origin are not detailed in the available metadata. Further inspection after download is required to confirm the data's structure and features.

MultimodalCrowdfundingOutcome PredictionFinancial Data+1

0 views

Multimodal & LLM

Pico Banana Smolvlm Format With Rejected Answer

A balanced image-level tampering detection dataset in SmolVLM-style format. It includes chosen and rejected answer pairs derived from the pico-banana MCQ pipeline, suitable for preference learning and RLHF-style training. The dataset was created by author vanloc1808 and was last updated on February 15, 2026.

MultimodalPreference LearningMultimodal DatasetImage Tampering DetectionComputer VisionRlhf Training+1

0 views

Multimodal & LLM

KitaKo: 110k Images with Parallel Captions in English, Filipino, and Taglish

KitaKo Multimodal Dataset contains 110,000 images paired with 548,000 parallel captions. The captions are provided in three languages: English, Filipino, and Taglish. The dataset's author, organization, and last update date are unknown.

ImageTextMultimodalFilipino LanguageImage CaptionsMachine LearningTaglish+1

0 views

Multimodal & LLM

SENTINEL: A Dataset for Mitigating Object Hallucinations in Multimodal LLMs

A dataset associated with the ICCV 2025 paper 'SENTINEL: Mitigating Object Hallucinations via Sentence-Level Early Intervention'. The dataset was created by author psp-dada and last updated on February 11, 2026. It is designed to address the problem of fabricated content in multimodal large language models.

MultimodalHallucination MitigationModel EvaluationMultimodal LlmComputer VisionNatural Language Processing+1

0 views

Multimodal & LLM

BLIP COCO Action Caption Finetuned: Multimodal Image-Text Data

A finetuned version of the BLIP model on the COCO dataset, likely containing image-text pairs for action captioning tasks. The dataset is hosted on Kaggle, but its specific size, columns, and creation details are unknown. Its content and scale require verification after download.

MultimodalMultimodal AiComputer VisionImage CaptioningCoco Dataset+1

0 views

Multimodal & LLM

InternVL24B-VLM-CIA: Vision-Language Model Dataset

A dataset titled 'internvl24b-vlm-cia' is hosted on Kaggle. The name suggests it is likely a multimodal dataset for training or evaluating vision-language models. No further metadata is available to confirm its size, origin, or specific content.

MultimodalVision Language ModelMultimodal AiComputer Vision+1

0 views

Multimodal & LLM

Resized-Kvsair VQA: A Visual Question Answering Dataset

Resized-Kvsair VQA is a dataset for visual question answering tasks, likely containing pairs of images and corresponding questions. It is hosted on Kaggle, a popular platform for data science competitions and datasets. The dataset's specific content, size, and creation details are not provided in the available metadata.

MultimodalMultimodal AiComputer VisionNatural Language ProcessingVisual Question Answering+1

0 views

Multimodal & LLM

VLM Data: Vision-Language Model Training Dataset

VLM Data is a dataset hosted on Hugging Face by author INV-WZQ. The dataset was last updated on April 1, 2026. Its specific content and scale are not detailed in the available metadata.

MultimodalVision Language ModelsRegionus+1

0 views

Multimodal & LLM

Viet OCR VQA Flash2: 137,000 Vietnamese Images with 822,000 Q&A Pairs

137,000 images containing Vietnamese text paired with 822,679 synthetic visual question-answering pairs generated by Gemini 1.5 Flash. Created by 5CD-AI and updated in February 2026, this collection focuses on Vietnamese OCR and scene understanding.

ParquetTask Categoriestext GenerationVisionImage Text To TextLibrarypolarsTask Categoriesimage To TextLibrarydaskTask Categoriesvisual Question AnsweringModalitytextSize Categories100 Kn1 MLibrarymlcroissantModalityimageLibrarydatasetsRegionusArxiv240812480Languagevi+1

0 views

Multimodal & LLM

BLIP Finetuning on Flickr8K Dataset

A finetuned version of the BLIP model, likely adapted for vision-language tasks. The dataset is hosted on Kaggle, but its specific content and scale are not detailed in the provided metadata. The original Flickr8K dataset is a standard benchmark for image captioning, suggesting this resource may contain model weights or related training data.

MultimodalVision LanguageImage CaptioningFlickr8kBlip+1

0 views

Multimodal & LLM

Boss Alignment Dataset: AI Capability Assessment Data

Boss Alignment Dataset is a collection for calibrating expectations of AI capabilities, likely containing examples or feedback. Authored by ChenZiHong-Gavin, it was last updated on GitHub on 2026-04-19. The specific content, scale, and structure require verification after download.

TextLlm EvaluationHuman FeedbackBoss AlignmentAi Alignment+1

0 views

Multimodal & LLM

500 TEST VQA: Visual Question Answering Test Set

500 TEST VQA is a dataset for evaluating visual question answering models. It was published on Kaggle, but its author, organization, and creation date are unknown. The dataset's exact size, format, and annotation details require verification after download.

MultimodalMultimodal AiComputer VisionNatural Language ProcessingVisual Question Answering+1

0 views

Multimodal & LLM

IoT Relay Fault Monitoring Dataset

Multimodal Smart Grid Condition Records likely capture sensor data related to electrical relay performance. The dataset is hosted on Kaggle, but its specific size, origin, and update history are unspecified. Columns and sample data are unknown, requiring verification after download.

MultimodalFault MonitoringCondition RecordsIotSmart Grid+1

0 views

Multimodal & LLM

EgoBench: A Benchmark for Multimodal Tool-Using Agents

EgoBench is a multimodal interactive benchmark designed for evaluating tool-using agents. The benchmark likely contains tasks requiring agents to process and interact with multiple data modalities. Its specific size, format, and creation details are unknown.

MultimodalAi EvaluationTool UseAgent BenchmarkBenchmark+1

0 views

Multimodal & LLM

VLMNS6: Vision-Language Model Training Data

VLMNS6 is a dataset published on Kaggle, a platform for data science competitions and open data. Its title suggests a focus on vision-language models, which combine computer vision and natural language processing. The dataset's specific content, scale, and origin are not detailed in the available metadata.

MultimodalVision LanguageComputer Vision+1

0 views

Multimodal & LLM

Sequential Movie Preference Dataset for Behavior-Aware Insights

Sequential Movie Preference Dataset is a collection of user behavior data for personalized movie insights, published on Kaggle. The dataset likely contains sequences of user interactions or preferences related to movies. Its specific size, origin, and update history are not detailed in the provided metadata.

TabularTime SeriesSequential DataUser BehaviorMovie RecommendationPreference Modeling+1

0 views

PreviousPage 43 of 97Next