Loading...
Loading...
Image-text pairs, instruction tuning, visual QA, cross-modal data, foundation model training data
1,551 datasets
90 animal species are categorized within this Vietnamese-language Visual Question Answering (VQA) dataset. The collection pairs images of animals with corresponding Vietnamese text questions and answers to facilitate multimodal learning.
A large-scale collection of mathematical problems categorized for Multi-Model TIR tasks. It provides structured data for training and evaluating reasoning-based mathematical solvers through multi-step logic.
53,202 instruction-tuning examples covering over 200 specialized cybersecurity domains, including cloud-native threats and AI/ML security. Created by the Trendyol Security Team for training defensive security AI assistants, this dataset was expanded from an initial 21,000 rows. The dataset was last updated on December 16, 2025.
A sample subset of data for Visual Question Answering (VQA), a multimodal AI task. The dataset is hosted on Kaggle, but its specific size, origin, and update history are not detailed in the provided metadata. Content likely pairs images with corresponding questions and answer annotations.
753,715 medical image-text pairs totaling 49.37 GB, designed for fine-tuning models like LLaVA-Med++. The dataset, created by Kafoo and last updated in November 2025, is stored in JSONL format alongside its images. Its captions are notably concise, averaging 1.0 words in length.
A dataset titled 'multimodal-vlm2' hosted on Kaggle. The title suggests it contains data for training or evaluating Vision-Language Models, which typically integrate visual and textual information. The dataset's specific content, size, and origin are not detailed in the provided metadata.
A dataset from Kaggle, likely containing paired image and text data for training or evaluating vision-language models. The specific content, scale, and creation details are not provided in the available metadata.
i-CIR is a benchmark for instance-level composed image retrieval containing between 100,000 and 1,000,000 records, released by billpsomas in 2024. It facilitates the retrieval of specific, visually indistinguishable objects by combining a reference image with a text-based modification query. The dataset includes a specialized database of visual, textual, and compositional hard negatives to test model precision.
A processed and enhanced version of the H&M Personalized Fashion Recommendations Kaggle competition dataset. The dataset has been cleaned and augmented with pre-computed embeddings and accessible image URLs by Qdrant, last updated in December 2025.
127,460 query-image pairs for visual document retrieval comprise this training set released by vidore in 2024. It combines 63% academic data from sources like DocVQA with 37% synthetic PDF pages augmented by Claude-3 Sonnet pseudo-questions.
Simuletic's Surveillance VLM Weapon Knife Detection Dataset is an open-source subset of the Simuletic Safety VLM Dataset. It is designed for instruction tuning of Vision Language Models to locate weapons and knives, reason about threats, and avoid false positives. The dataset was last updated on December 17, 2025.
TaiwanVQA is a visual question answering benchmark containing 2,736 original images paired with 5,472 manually designed questions. It is designed to evaluate the capability of vision-language models in recognizing and reasoning about culturally specific content related to Taiwan. The dataset was created by author hhhuang and last updated on December 4, 2025.
Seamless Interaction contains over 4,000 hours of multimodal face-to-face interaction footage featuring more than 4,000 participants. Released by Meta in 2025, this collection captures the complex interplay of verbal and nonverbal signals during human communication for AI research.
Over 400,000 human preference responses for evaluating the Flux 2 Pro text-to-image model, collected in less than seven hours via the Rapidata Python API. The dataset was created by Rapidata and last updated on December 2, 2025. It includes evaluations across preference, coherence, and alignment categories.
Multimodal Physiological Stress Dataset is a collection of dynamic stress data from college students, published on Kaggle. The dataset likely contains time-series physiological measurements, though specific columns and sample sizes are not detailed in the provided metadata. Its raw description indicates a focus on student stress levels, but the exact collection methodology and temporal coverage are unknown.
Vision-Language Models are Confused Tourists evaluates the cultural robustness of VLMs, a largely untested dimension crucial for supporting diverse societies. The dataset was created by author patrickamadeus and was last updated in December 2025. It contains image-text pairs designed to test model stability across diverse cultural inputs.
Chart VQA likely contains images of charts and graphs paired with natural language questions and answers. The dataset is hosted on Kaggle, a platform for data science competitions and projects. Specific details on volume, creation date, and authorship are not provided in the available metadata.
Image caption data likely contains pairs of images and descriptive text. The dataset is hosted on Kaggle, a platform for data science competitions and projects. Specific details on volume, creation method, and update recency are not provided in the metadata.
A dataset from Kaggle focusing on college students' career preferences. The raw description suggests it includes psychological and IoT behavioral indicators. The specific scale, collection method, and temporal coverage are not detailed in the provided metadata.
Viet-Chart-VQA-images is a dataset hosted on Kaggle. The title suggests it contains images paired with questions and answers, likely for training or evaluating Visual Question Answering models. The dataset's content, scale, and provenance require verification after download.