Loading...
Loading...
Image-text pairs, instruction tuning, visual QA, cross-modal data, foundation model training data
1,543 datasets
A dataset titled 'baseline-vqa-cv' is hosted on Kaggle. The dataset likely contains image-text pairs for visual question answering tasks, a common benchmark in computer vision and AI. Its specific content, scale, and authorship require verification after download.
llava_annotations_pascal_voc is a dataset hosted on Kaggle. The title suggests it contains annotations, likely for images from the PASCAL VOC dataset, generated or used by the LLaVA (Large Language-and-Vision Assistant) model. The dataset's specific content, size, and creation details are not provided in the available metadata.
Humanity's Last Exam (HLE) is a multi-modal benchmark containing 2,500 questions across dozens of academic subjects, released by the Center for AI Safety and Scale AI in January 2026. It serves as a frontier-level evaluation suite designed to test the limits of human knowledge through closed-ended questions.
A dataset for visual question answering (VQA) tasks in the medical domain, specifically focused on radiology images. It was published on the Kaggle platform, but detailed information about its size, creation date, and authors is not provided in the available metadata. The dataset likely contains pairs of medical images and associated textual questions and answers.
A dataset titled 'med_vqa' hosted on Kaggle. The title suggests it contains medical visual question-answering data, likely pairing medical images with related questions and answers. The dataset's specific scale, origin, and creation date are unknown from the provided metadata.
Aligned-8-Emotion-Dataset-Final is a multimodal dataset containing 88,360 face images and text in both English and Amharic, annotated for 8 emotion categories. The dataset appears to be sourced from Kaggle, but specific authorship, collection methodology, and temporal details are not provided. Its primary purpose is likely for training and evaluating emotion recognition models across different data modalities and languages.
A dataset titled 'kvasir-vqa-dataset-images' published on Kaggle. The name suggests it likely contains medical images paired with questions and answers for visual question answering tasks. The dataset's author, organization, size, and specific content are unknown.
A text dataset focused on Nigerian linguistic alignment, published on Kaggle. The raw description suggests it contains creative stories, likely in Nigerian languages or dialects. The author, organization, and specific data characteristics are not provided in the metadata.
Amazon Multimodal Product Classification Dataset is hosted on Kaggle. The dataset title suggests it contains product information from Amazon, likely combining text and image data for classification tasks. Specific details on size, creation date, and authorship are not provided in the available metadata.
Annotations likely linking images to text, created for the LLaVA (Large Language-and-Vision Assistant) project. The dataset is hosted on Kaggle, but its specific size, structure, and creation details are not provided in the available metadata. The content appears to be derived from or related to the MS COCO (Common Objects in Context) image dataset.
Kaggle hosts the blip-itm-v3-checkpoint-v3, a model checkpoint for the BLIP (Bootstrapping Language-Image Pre-training) architecture. The checkpoint likely contains parameters for image-text matching tasks, enabling vision-language model fine-tuning. Its specific training data, size, and performance metrics are not detailed in the provided metadata.
Solar-Icicles-Multimodal-V1 is a dataset described as 'The Night Crew Benchmark: A Comparative Study of 7 SOTA Video Architectures'. It is hosted on Kaggle. The dataset's author, organization, size, and specific contents are not detailed in the provided metadata.
A multimodal dataset containing 88,360 face images and text in English and Amharic, annotated for 8 emotion categories. It is hosted on Kaggle and intended for sentiment and emotion analysis tasks. The author, organization, and specific collection details are not provided.
KORE-74K is a multimodal dataset containing over 74,000 training entries for image recognition, captioning, and visual question answering tasks. It was created by author kailinjiang and published in 2026, building upon the MMEVOKE dataset. The data includes separate archives for recognition/caption images and VQA images, paired with structured JSON annotations.
ChartVerse-RL-40K is a curated dataset of the most challenging chart reasoning samples for Reinforcement Learning, developed by opendatalab. It contains samples with the highest failure rates, which strong Vision-Language Models struggle with but can still solve occasionally, providing a strong learning signal for RL training. The dataset was last updated on 2026-01-21.
ynyg's Unified-Prompt-Guard dataset, last updated January 2026, is a text dataset for training binary classifiers to defend against LLM jailbreak attacks and unsafe prompts. It contains 265,589 training, 10,857 validation, and 10,857 test samples, synthesized from three high-quality sources including jailbreak-detection-dataset, Nemotron-Safety-Guard-Dataset-v3 (zh), and PKU-SafeRLHF.
A dataset titled 'WAVLM_Age(VF)' hosted on Kaggle. The title suggests it contains voice features likely extracted using the WAVLM model for the purpose of age prediction or analysis. No further metadata, such as sample count, file formats, or author details, is provided.
RadioML Optimized Multimodal Dataset is a processed version of the RadioML dataset, stored in Zarr format. The dataset appears to be optimized for machine learning workflows and includes multimodal features. The original author, organization, and specific data volume are not provided in the available metadata.
VinDr-CXR-VQA is a large-scale dataset combining 4,394 chest X-ray images with 17,597 natural language question-answer pairs. The dataset, created by faizan711 and last updated in January 2026, is designed for explainable medical AI and includes spatial grounding annotations and clinical reasoning explanations. It features six distinct question types to facilitate research in medical visual question answering.
vifoodvqa is a dataset published on Kaggle. The title suggests it is a Visual Question Answering (VQA) dataset focused on food images. The dataset's specific content, size, and origin require verification after download due to minimal provided metadata.