DataSalon

Discover quality datasets for AI training — aggregated from 40+ platforms, curated by AI.

ProductSearch Datasets Browse Topics Rankings Community API / MCP

ResourcesDocumentation Blog Changelog Status

LegalPrivacy Policy Terms of Service Cookie Policy

Multimodal & LLM Datasets | DataSalon

All Categories

🔗

Multimodal & LLM

Image-text pairs, instruction tuning, visual QA, cross-modal data, foundation model training data

1,948 datasets

Multimodal & LLM

LLaVA Annotations for PASCAL VOC: Visual Question Answering Labels

llava_annotations_pascal_voc is a dataset hosted on Kaggle. The title suggests it contains annotations, likely for images from the PASCAL VOC dataset, generated or used by the LLaVA (Large Language-and-Vision Assistant) model. The dataset's specific content, size, and creation details are not provided in the available metadata.

MultimodalPascal VocComputer VisionObject DetectionImage Annotation+1

0 views

Multimodal & LLM

Medical VQA 5 Datasets: Vision-Language Medical Data

A collection of five datasets for Medical Visual Question Answering (VQA). It was published on huggingface by MohamedAhmedAE and last updated on March 8, 2026. The datasets likely contain paired medical images and text questions to train and evaluate AI models on medical reasoning tasks.

MultimodalMedical ImagingVision LanguageQuestion AnsweringHealthcareMedical Vqa+1

0 views

Multimodal & LLM

Humanity's Last Exam: 2,500 Multi-Modal Frontier Knowledge Questions

Humanity's Last Exam (HLE) is a multi-modal benchmark containing 2,500 questions across dozens of academic subjects, released by the Center for AI Safety and Scale AI in January 2026. It serves as a frontier-level evaluation suite designed to test the limits of human knowledge through closed-ended questions.

ParquetSize Categories1 Kn10 KLibrarypolarsBenchmarkofficialModalitytextLibrarymlcroissantModalityimageLibrarydatasetsLibrarypandasRegionusLicensemit+1

0 views

Multimodal & LLM

Med-VQA: Medical Visual Question Answering Dataset

A dataset titled 'med_vqa' hosted on Kaggle. The title suggests it contains medical visual question-answering data, likely pairing medical images with related questions and answers. The dataset's specific scale, origin, and creation date are unknown from the provided metadata.

MultimodalMultimodal QaVision LanguageMedical VqaHealthcare Ai+1

0 views

Multimodal & LLM

VQA-RAD: Visual Question Answering for Radiology Images

A dataset for visual question answering (VQA) tasks in the medical domain, specifically focused on radiology images. It was published on the Kaggle platform, but detailed information about its size, creation date, and authors is not provided in the available metadata. The dataset likely contains pairs of medical images and associated textual questions and answers.

MultimodalMedical ImagingMultimodal AiRadiologyVisual Question Answering+1

0 views

Multimodal & LLM

Aligned-8-Emotion: Multimodal Dataset with English and Amharic Text and 88,360 Face Images

Aligned-8-Emotion-Dataset-Final is a multimodal dataset containing 88,360 face images and text in both English and Amharic, annotated for 8 emotion categories. The dataset appears to be sourced from Kaggle, but specific authorship, collection methodology, and temporal details are not provided. Its primary purpose is likely for training and evaluating emotion recognition models across different data modalities and languages.

MultimodalAmharic LanguageMultimodal DataEmotion RecognitionEnglish LanguageFacial Expression+1

0 views

Multimodal & LLM

Kvasir-VQA: Medical Images with Visual Question Answering Labels

A dataset titled 'kvasir-vqa-dataset-images' published on Kaggle. The name suggests it likely contains medical images paired with questions and answers for visual question answering tasks. The dataset's author, organization, size, and specific content are unknown.

MultimodalMedical ImagingComputer VisionVisual Question Answering+1

0 views

Multimodal & LLM

Nigerian Linguistic Alignment Dataset of Creative Stories

A text dataset focused on Nigerian linguistic alignment, published on Kaggle. The raw description suggests it contains creative stories, likely in Nigerian languages or dialects. The author, organization, and specific data characteristics are not provided in the metadata.

TextLinguisticsNigerian LanguagesAlignment DataText Corpus+1

0 views

Multimodal & LLM

Amazon Product Classification with Multimodal Features

Amazon Multimodal Product Classification Dataset is hosted on Kaggle. The dataset title suggests it contains product information from Amazon, likely combining text and image data for classification tasks. Specific details on size, creation date, and authorship are not provided in the available metadata.

MultimodalE CommerceProduct CategorizationClassificationAmazon+1

0 views

Multimodal & LLM

blip-itm-v3-checkpoint-v3: Vision-Language Model Checkpoint

Kaggle hosts the blip-itm-v3-checkpoint-v3, a model checkpoint for the BLIP (Bootstrapping Language-Image Pre-training) architecture. The checkpoint likely contains parameters for image-text matching tasks, enabling vision-language model fine-tuning. Its specific training data, size, and performance metrics are not detailed in the provided metadata.

MultimodalBlip ModelVision LanguageComputer VisionImage Captioning+1

0 views

Multimodal & LLM

LLaVA Annotations for MS COCO: Vision-Language Grounding Data

Annotations likely linking images to text, created for the LLaVA (Large Language-and-Vision Assistant) project. The dataset is hosted on Kaggle, but its specific size, structure, and creation details are not provided in the available metadata. The content appears to be derived from or related to the MS COCO (Common Objects in Context) image dataset.

MultimodalVision LanguageLlavaMultimodal AnnotationsImage Captioning+1

0 views

Multimodal & LLM

Solar-Icicles-Multimodal-V1: A Benchmark for Video Architecture Comparison

Solar-Icicles-Multimodal-V1 is a dataset described as 'The Night Crew Benchmark: A Comparative Study of 7 SOTA Video Architectures'. It is hosted on Kaggle. The dataset's author, organization, size, and specific contents are not detailed in the provided metadata.

VideoMultimodalBenchmarkComputer VisionVideo ArchitecturesDeep Learning+1

0 views

Multimodal & LLM

Aligned 8 Emotion Dataset: Multimodal English and Amharic Text with Face Images

A multimodal dataset containing 88,360 face images and text in English and Amharic, annotated for 8 emotion categories. It is hosted on Kaggle and intended for sentiment and emotion analysis tasks. The author, organization, and specific collection details are not provided.

TextMultimodalAmharic TextSentiment AnalysisEmotion ClassificationNatural Language ProcessingFacial ExpressionText Data+1

0 views

Multimodal & LLM

KORE-74K Image Recognition and Captioning Dataset

KORE-74K is a multimodal dataset containing over 74,000 training entries for image recognition, captioning, and visual question answering tasks. It was created by author kailinjiang and published in 2026, building upon the MMEVOKE dataset. The data includes separate archives for recognition/caption images and VQA images, paired with structured JSON annotations.

MultimodalTask Categoriesimage Text To TextMultimodal AiArxiv251019316Computer VisionImage CaptioningRegionusRecognition DataVisual Question Answering+1

0 views

Multimodal & LLM

ChartVerse-RL-40K: Challenging Chart Reasoning Samples for Reinforcement Learning

ChartVerse-RL-40K is a curated dataset of the most challenging chart reasoning samples for Reinforcement Learning, developed by opendatalab. It contains samples with the highest failure rates, which strong Vision-Language Models struggle with but can still solve occasionally, providing a strong learning signal for RL training. The dataset was last updated on 2026-01-21.

MultimodalDifficult SamplesVision Language ModelsChart ReasoningReinforcement Learning+1

0 views

Multimodal & LLM

Unified Prompt Guard: 287,303 Samples for LLM Jailbreak and Harmful Input Detection

ynyg's Unified-Prompt-Guard dataset, last updated January 2026, is a text dataset for training binary classifiers to defend against LLM jailbreak attacks and unsafe prompts. It contains 265,589 training, 10,857 validation, and 10,857 test samples, synthesized from three high-quality sources including jailbreak-detection-dataset, Nemotron-Safety-Guard-Dataset-v3 (zh), and PKU-SafeRLHF.

TextPrompt SafetyText ClassificationLlm SecurityHarmful Content+1

0 views

Multimodal & LLM

WAVLM_Age(VF): Voice Feature Data for Age Prediction

A dataset titled 'WAVLM_Age(VF)' hosted on Kaggle. The title suggests it contains voice features likely extracted using the WAVLM model for the purpose of age prediction or analysis. No further metadata, such as sample count, file formats, or author details, is provided.

TabularAudioAudio ClassificationVoice FeaturesSpeech ProcessingAge Prediction+1

0 views

Multimodal & LLM

RadioML Optimized Dataset with Multimodal Features

RadioML Optimized Multimodal Dataset is a processed version of the RadioML dataset, stored in Zarr format. The dataset appears to be optimized for machine learning workflows and includes multimodal features. The original author, organization, and specific data volume are not provided in the available metadata.

MultimodalMachine LearningRadio SignalSignal ProcessingWireless Communication+1

0 views

Multimodal & LLM

VinDr-CXR-VQA: Chest X-Ray Visual Question Answering with Spatial Grounding

VinDr-CXR-VQA is a large-scale dataset combining 4,394 chest X-ray images with 17,597 natural language question-answer pairs. The dataset, created by faizan711 and last updated in January 2026, is designed for explainable medical AI and includes spatial grounding annotations and clinical reasoning explanations. It features six distinct question types to facilitate research in medical visual question answering.

MultimodalMedical ImagingChest X RayHealthcareLarge ScaleNatural Language ProcessingVisual Question AnsweringExplainable Ai+1

0 views

Multimodal & LLM

VIFoodVQA: Visual Question Answering Dataset for Food Images

vifoodvqa is a dataset published on Kaggle. The title suggests it is a Visual Question Answering (VQA) dataset focused on food images. The dataset's specific content, size, and origin require verification after download due to minimal provided metadata.

MultimodalComputer VisionNatural Language ProcessingFoodVisual Question Answering+1

0 views

PreviousPage 48 of 97Next