Loading...
Loading...
Image-text pairs, instruction tuning, visual QA, cross-modal data, foundation model training data
1,543 datasets
A multimodal dataset for cardiovascular risk prediction, sourced from Kaggle. It combines ECG images with tabular clinical and biomarker data. The author, organization, size, and temporal coverage are unspecified.
RelNorm Results is a dataset from Kaggle focused on evaluating the understanding of social norms in multimodal AI models. The dataset likely contains test results and performance metrics from experiments assessing how models interpret social contexts across different modalities. The author, organization, and specific data scale are not provided in the input.
VQAdataset2 is a dataset for visual question answering tasks, published on Kaggle. The dataset likely contains paired images and text questions with corresponding answers. Specific details on size, columns, and creation are not provided in the metadata.
MultiModal Heart Disease Dataset is a dataset published on Kaggle. Its title suggests it likely contains data related to cardiovascular health, potentially combining different data types. Metadata is minimal; actual content requires verification after download.
A dataset likely containing files for training or evaluating Vision-Language Models (VLMs) for the Hindi language. It is published on the Kaggle platform. The specific content, scale, and creation details are not provided in the available metadata.
hdrcde is a dataset for computational statistics, focusing on highest density regions and conditional density estimation. It was authored by Rob J. Hyndman and is hosted on the paperswithcode platform. The dataset's specific size, temporal coverage, and geographic scope are not detailed in the provided metadata.
Kaggle hosts the BMP-VLM-2 dataset. The title suggests it contains data for training or evaluating vision-language models, which combine image and text understanding. Specific details regarding its size, creation date, and authorship are not provided in the available metadata.
MolParse v1.0 is a multimodal dataset released in January 2026 for optical chemical structure parsing. It contains a large-scale collection of molecular structure images sourced from scientific literature, designed to train models that convert diagrams into structured chemical representations.
A dataset from the LLaVA (Large Language-and-Vision Assistant) project, likely containing multimodal data for training or evaluating vision-language models. The dataset is hosted on Kaggle, but its specific contents, size, and creation details are not provided in the metadata. Further details about the data's origin, collection method, and temporal coverage are unknown.
Keysay VLM Context Training is a multimodal dataset for vision-language model development, curated by Enriqueag26. It contains image-text pairs, as indicated by its platform tags for image and text modalities, and was last updated in March 2026.
A dataset likely containing images paired with textual captions, inferred from the title 'blip_captions_data'. It is hosted on Kaggle, but detailed metadata such as size, source, and creation date is unavailable. The content and structure require verification after download.
SynVQA-UITAIC is a dataset hosted on Kaggle. The title suggests it is likely a benchmark dataset for evaluating Visual Question Answering (VQA) systems, possibly containing synthetic or generated visual and textual content. Its specific contents, size, and authorship are unknown from the provided metadata.
VideoMind-SFT contains 481,000 video-annotation pairs and a 210,000-record Grounder subset released by yeliudev in early 2026. The collection provides videos in both original formats and compressed versions at 3 FPS and 480p resolution without audio for efficient model training.
A dataset titled 'test-images-vqa' is hosted on Kaggle. The dataset likely contains images paired with questions and answers for visual question answering tasks. Metadata such as size, columns, and license are currently unknown.
BLIP_test is a dataset hosted on Kaggle. Its title suggests it is likely related to the BLIP (Bootstrapping Language-Image Pre-training) model, a vision-language framework. The dataset's specific content, size, and structure are unknown from the provided metadata.
llava-annotations-pascal-voc is a dataset hosted on Kaggle. The title suggests it contains annotations generated by the LLaVA (Large Language and Vision Assistant) model for the PASCAL VOC object detection and segmentation benchmark. The dataset likely provides question-answer pairs or descriptive labels for images, linking visual content with language.
A collection of training and evaluation files derived from the MultiVENT 2.0 benchmark for text-to-video retrieval. The dataset provides structured query-video pairs within training_data.json designed to facilitate explicit reasoning over video content for relevance assessment.
PhoStream contains 5,572 open-ended QA pairs derived from 578 videos across 4 scenarios and 10 capabilities, released by lucky-lance in 2026. This benchmark evaluates omnimodal assistants in mobile-centric streaming environments, focusing on both on-screen and off-screen phone usage. It specifically tests a model's ability to determine both the timing and the content of responses while processing continuous audio-visual streams.
Highlighting pre-rendered 3D multi-room environments categorized for evaluating spatial reasoning in Vision Language Models. It provides structured visual scene data to support the Theory of Space benchmark, focusing on active exploration and the construction of spatial beliefs.
Kaggle hosts the MiniVLM dataset, which is likely related to vision-language modeling. The dataset's specific content, size, and creation details are not provided in the available metadata.