Loading...
Loading...
Image-text pairs, instruction tuning, visual QA, cross-modal data, foundation model training data
1,543 datasets
RL GSPO Qwen2.5VLM Staged Code V2 is a dataset hosted on Kaggle. The title suggests it relates to reinforcement learning (RL) and staged training for a vision-language model (VLM) named Qwen2.5. The dataset likely contains data used for training or evaluating such models.
An academic dataset from KAIST, William and Mary, University of Alberta, and Auburn University, released in December 2025. It demonstrates a performance gap in state-of-the-art Vision Language Models (VLMs), which perform perfectly on counting tasks with original images but fail catastrophically on modified versions. The dataset is hosted on Hugging Face by author anvo25.
Orient Anything V2 is an enhanced foundation model for unified understanding of object 3D orientation and rotation from single or paired images. This repository contains the final rendering data used for training the model, as provided by author Viglong. The dataset was last updated on January 13, 2026.
VQA_VIDEOS is a dataset hosted on Kaggle. The title suggests it contains video content paired with questions and answers for visual question answering tasks. The dataset's specific size, content details, and origin are not provided in the available metadata.
Multimodal Emotion Dialogue Dataset is a collection of records for analyzing emotional states in conversations. The dataset likely contains speech, image, and interaction data, as indicated by its raw description. It is hosted on Kaggle, but specific details on its size, creation, and update history are not provided.
Kaggle hosts this dataset titled 'BlipCaptioningOutput'. The title suggests it contains outputs from the BLIP (Bootstrapping Language-Image Pre-training) model, likely pairing images with generated or ground-truth captions. No further metadata on size, source, or creation date is provided.
SciCap Dataset provides pairs of scientific images with corresponding captions. It is designed for training and evaluating multimodal models. The dataset was created for research in scientific image understanding.
A dataset titled 'wavlm_region_model' is hosted on Kaggle. The dataset likely contains audio feature representations or model outputs from the WavLM architecture. Metadata is minimal; actual content, size, and structure require verification after download.
A Kaggle-hosted dataset titled 'wavlm_gender_model'. The dataset's content likely relates to audio data processed by the WavLM architecture for gender classification tasks. Metadata is minimal; the specific number of samples, audio characteristics, and creation details require verification after download.
A dataset from Kaggle related to reinforcement learning (RL) for the Qwen2.5 Vision-Language Model (VLM). The dataset's title suggests it involves staged code, likely pertaining to training procedures or generated outputs. The specific content, scale, and authorship require verification after download.
SearchVLM is a dataset published on Kaggle. The title suggests it relates to vision-language models, likely containing data for search and retrieval tasks. Specific details on size, creator, and temporal coverage are not provided in the available metadata.
CURATED_VLM_DATASETS_987486 is a dataset collection published on Kaggle. Its title suggests it contains data for training and evaluating Vision-Language Models. The specific contents, size, and origin are not detailed in the provided metadata.
Digital heritage data focuses on the preservation of cultural performances and traditions. The dataset's size, author, and last update date are not specified. It is hosted on the Kaggle platform.
Kaggle hosts this dataset titled 'blipcaptionsoutput'. The title suggests it contains image captions generated by the BLIP (Bootstrapping Language-Image Pre-training) model. The dataset's scale, origin, and specific content are not detailed in the provided metadata.
Kaggle hosts the MedVQA-GI-2026 dataset. It is a multimodal dataset for medical visual question answering, specifically focused on gastrointestinal topics. The dataset's author, organization, and specific scale are not provided in the metadata.
Puffin-4M is a large-scale, high-quality dataset containing 4 million samples for camera-centric multimodal understanding and generation. It integrates vision, language, and camera modalities to address the scarcity of benchmarks in spatial multimodal intelligence. The dataset was created by KangLiao and was last updated in January 2026.
Nemotron-RL-instruction_following combines prompts from the WildChat-1M dataset with verifiable instructions from the Open-Instruct code base. Created by NVIDIA, this dataset is designed for training and evaluating models on objective instruction adherence. It was last updated in January 2026.
TAOBAO-MM is a large-scale recommendation dataset derived from user interaction logs on Taobao, one of the world's largest e-commerce platforms. It features historical behavior sequences of up to 1,000 interactions per user and includes high-quality multimodal embeddings. The dataset was authored by TaoBao-MM and was last updated on the Hugging Face platform on 2026-01-15.
ActionDetectionDatasetVLM is a dataset published on Kaggle. Its title suggests it contains video data annotated for action detection tasks, likely intended for training or evaluating vision-language models. The dataset's specific content, size, and origin require verification after download.
Kaggle hosts the synthvision_medical_vqa dataset, which likely contains synthetic medical images paired with questions and answers for visual question answering tasks. The dataset's author, organization, and specific scale are unknown. Its last update date is also unspecified.