DataSalon

Discover quality datasets for AI training — aggregated from 40+ platforms, curated by AI.

ProductSearch Datasets Browse Topics Rankings Community API / MCP

ResourcesDocumentation Blog Changelog Status

LegalPrivacy Policy Terms of Service Cookie Policy

Multimodal & LLM Datasets | DataSalon

All Categories

🔗

Multimodal & LLM

Image-text pairs, instruction tuning, visual QA, cross-modal data, foundation model training data

1,956 datasets

Multimodal & LLM

ViTextVQA: 50,000 Vietnamese Text-Based Visual Question Answer Pairs

ViTextVQA contains over 16,000 images and 50,000 question-answer pairs focused on Vietnamese text comprehension within visual contexts. Developed by researcher minhquan6203 and documented in Arxiv paper 2404.10652, it serves as a benchmark for text-based visual question answering in the Vietnamese language.

Licensecc By Nc 30RegionusArxiv240410652+1

0 views

Multimodal & LLM

EyeVLM: A Vision-Language Model Dataset

EyeVLM Dataset is a collection of data likely designed for training or evaluating vision-language models. The dataset is hosted on Kaggle, but its specific contents, size, and creation details are not provided in the available metadata. Further verification is required to confirm the exact nature and scope of the included data.

MultimodalVision LanguageImage TextAi Training+1

0 views

Multimodal & LLM

SmolVLM_vigor_annotations: Vision-Language Model Evaluation Data

Kaggle hosts the SmolVLM_vigor_annotations dataset. The title suggests it contains annotations for evaluating vision-language models, likely on tasks like visual grounding or reasoning. The dataset's specific content, size, and origin require verification after download.

MultimodalVision Language ModelAnnotationsMultimodal AiComputer Vision+1

0 views

Multimodal & LLM

Ecological Environment Interaction Data Collection

Ecological environment interaction data is collected in this multimodal dataset. The author, organization, and specific volume of data are not specified. The last update date is also unknown.

GeospatialMultimodalEnvironmental scienceMultimodal DataEcological Environment+1

0 views

Multimodal & LLM

RGB and Thermal Traces for Dynamic Event Reconstruction

A multimodal dataset containing paired RGB and thermal image traces for reconstructing events in dynamic environments. The dataset is designed for research in multimodal sensor fusion and computer vision. Information on the creator, size, and specific temporal coverage is not provided in the input.

ImageMultimodalMultimodal Event ReconstructionComputer VisionSensor Fusion+1

0 views

Multimodal & LLM

OLIMP: A Heterogeneous Multimodal Dataset for Environment Perception

OLIMP is a heterogeneous multimodal dataset designed for advanced environment perception tasks. The dataset likely contains multiple data types, such as images, video, or sensor readings, integrated for perception modeling. Its author, organization, and specific size are not provided in the metadata.

MultimodalMultimodal DataComputer VisionEnvironment PerceptionSensor Fusion+1

0 views

Multimodal & LLM

Remote Sensing Vision-Language Model Fine-Tuning Data

RSCoVLM is a dataset for co-training vision-language models on remote sensing imagery for multi-task learning. The dataset is associated with a published academic paper and was created by a team of researchers including Qingyun Li, Shuran Ma, and Junwei Luo. It was last updated in January 2026.

GeospatialLanguageenAerialModalityimageLicensecc By 40RegionusGeoscienceArxiv251121272+1

0 views

Multimodal & LLM

AgentNet: 22.6K Human-Annotated Computer-Use Tasks Across Three OSs

AgentNet contains 22.6K human-annotated computer-use trajectories across Windows, macOS, and Ubuntu operating systems. Developed by xlangai and released in early 2026, it serves as a foundation for training vision-language-action (VLA) models for desktop automation.

Task Categoriesimage Text To TextComputer UseArxiv250809123LanguageenRegionusAgentLicensemit+1

0 views

Multimodal & LLM

food-health-vlm-checkpoint

Food-Health-VLM-Checkpoint is a dataset published on Kaggle. The title suggests it contains a checkpoint for a Vision-Language Model (VLM) trained on food and health-related data. Specific details regarding data volume, creation method, and authorship are not provided in the available metadata.

MultimodalVision Language ModelMultimodal AiHealthcareCheckpointFood Health+1

0 views

Multimodal & LLM

food-health-vlm-checkpoint-best-1

Food-health-vlm-checkpoint-best-1 is a model checkpoint likely containing trained weights for a Vision-Language Model (VLM). The dataset is hosted on Kaggle, but its specific contents, creation date, and author are not detailed in the provided metadata. Its title suggests a focus on applying multimodal AI to topics at the intersection of food and health.

MultimodalMachine LearningVision Language ModelHealthcareCheckpointFood Health+1

0 views

Multimodal & LLM

WAVLM_Region(VF): Audio Feature Representations

A dataset titled 'WAVLM_Region(VF)' is hosted on Kaggle. The title suggests it contains audio feature representations, likely derived from the WAVLM model. No further metadata on size, format, or creation details is available.

AudioMachine LearningSpeech ProcessingAudio Representation+1

0 views

Multimodal & LLM

WAVLM_Gender(VF): Speech Audio for Gender Classification

Kaggle hosts the WAVLM_Gender(VF) dataset. The title suggests it contains speech audio data likely intended for gender classification tasks. Specific details on volume, creator, and creation date are unavailable.

AudioAudio FeaturesSpeech Processing+1

0 views

Multimodal & LLM

Tamil Cultural Multimodal Backdoor Dataset

A multimodal dataset related to Tamil culture and backdoor attacks in machine learning. The dataset is hosted on Kaggle, but its specific contents, size, and creation details are not provided in the available metadata. Further details regarding the data's origin, collection method, and temporal scope are unknown.

MultimodalCultural DatasetsMultimodal DataTamil CultureMachine Learning Security+1

0 views

Multimodal & LLM

Multimodal Emotion Dataset for Affective Computing Research

A dataset focused on emotion recognition, likely containing multiple data modalities such as text, audio, or images. It is hosted on Kaggle, a platform for data science and machine learning projects. The specific collection method, author, and temporal coverage are not detailed in the available metadata.

MultimodalAffective ComputingMultimodal DataEmotion Recognition+1

0 views

Multimodal & LLM

Tamil Cultural Multimodal Backdoor Dataset

MultimodalCultural DatasetsMultimodal DataTamil CultureMachine Learning Security+1

0 views

Multimodal & LLM

EgoGazeVQA: Egocentric Gaze-Guided Video Question Answering Benchmark

EgoGazeVQA is an egocentric gaze-guided video question answering benchmark introduced in the paper 'In the Eye of MLLM: Benchmarking Egocentric Video Intent Understanding with Gaze-Guided Prompting'. The dataset, created by author taiyi09, leverages gaze information to improve the understanding of daily-life videos. It was last updated on 2026-01-22.

VideoMultimodalBenchmarkGaze TrackingEgocentric VisionMultimodal BenchmarkVideo Question Answering+1

0 views

Multimodal & LLM

OctoCodingBench: Benchmark for Scaffold-Aware Instruction Following in Coding Agents

OctoCodingBench is a benchmark for evaluating scaffold-aware instruction following in repository-grounded agentic coding. It was created by MiniMaxAI and last updated on Hugging Face on January 13, 2026. The benchmark focuses on whether coding agents follow rules while solving tasks, a dimension not covered by existing task-completion benchmarks.

TextSoftware EngineeringBenchmarkCoding BenchmarkAgentic Coding+1

0 views

Multimodal & LLM

VIGOR-LLaVA-CAS-Baselines: Multimodal AI Benchmark Results

A dataset from Kaggle containing baseline results for the VIGOR-LLaVA-CAS benchmark. The dataset likely contains performance metrics and outputs for evaluating vision-language models. Its specific size, update date, and author are unknown.

MultimodalVision LanguageLlavaAi BenchmarkMultimodal Baselines+1

0 views

Multimodal & LLM

Nemotron-RL: Instruction Following with Structured JSON Outputs

NVIDIA's Nemotron-RL-instruction_following-structured_outputs dataset tests a model's ability to follow output formatting instructions under JSON schema constraints. Each problem consists of a document, an output formatting instruction (schema), and a question, with difficulty varied by instruction location, comprehensiveness, and schema complexity. The dataset was last updated on January 12, -2026.

TextJson SchemaNlp EvaluationStructured OutputModel Benchmark+1

0 views

Multimodal & LLM

TRACE: A Multimodal Dataset for AI Research

TRACE is a multimodal dataset published on Kaggle. The dataset's specific content, size, and creation details are not provided in the available metadata. Further verification is required to determine its exact composition, scale, and origin.

MultimodalMachine LearningAi ResearchComputer Vision+1

0 views

PreviousPage 51 of 98Next