Loading...
Loading...
Image-text pairs, instruction tuning, visual QA, cross-modal data, foundation model training data
1,543 datasets
Annotations for a Visual Question Answering dataset focused on animals. The dataset likely contains image-question-answer triplets, as suggested by the raw description. It is published on Kaggle, but details on the number of samples, collection method, and original authors are not provided in the available metadata.
A dataset likely for Visual Question Answering (VQA) tasks focused on food items. The dataset is hosted on Kaggle, but detailed metadata such as column descriptions, sample data, and size are unavailable. Its content and structure require verification after download.
Multimodal Stroke Data is a dataset hosted on Kaggle. The dataset likely contains information related to stroke diagnosis, treatment, or outcomes. Specific details regarding its size, origin, and creation date are not provided in the available metadata.
A multimodal dataset likely containing images and text related to traditional herbal medicine from the Nusantara region. The dataset appears designed for Visual Question Answering (VQA) tasks, where models must answer questions about visual content. It is hosted on Kaggle, but detailed metadata such as size, author, and license are currently unknown.
SurveillanceVQA-589K is a large-scale dataset for visual question answering tasks, likely derived from surveillance footage. The dataset is hosted on Kaggle and appears to be a testing subset of a larger collection. Its specific content, such as the number of video clips or question-answer pairs, requires verification after download.
Kaggle hosts the VQA-Rank8 dataset. The title suggests it is likely related to ranking tasks within the domain of visual question answering. No further metadata is available to confirm its specific content, size, or origin.
Multimodal_CSI_Text is a dataset published on Kaggle. The title suggests it contains Channel State Information (CSI) data, a type of wireless signal measurement, paired with text annotations. The dataset's specific content, scale, and collection details are not provided in the available metadata.
Quran-MD integrates textual, linguistic, and audio dimensions at the verse (ayah) and word levels. The dataset was created by Buraaq and the associated paper was accepted at the 5th Muslims in ML Workshop at NeurIPS 2025. It is part of a larger, complete Quran-MD collection.
CiteVQA is a dataset published on Kaggle. Its title suggests a focus on visual question answering tasks that require grounding answers in citations or references. The dataset's specific content, size, and origin require verification after download due to minimal provided metadata.
DiverseVQA is a dataset likely designed for visual question answering tasks, which involve answering natural language questions about images. It is hosted on the Kaggle platform, but detailed metadata such as the number of samples, specific image sources, and creation date are not provided. The dataset's content and scale require verification after download.
FRIEDA consists of 500 multimodal examples for open-ended cartographic reasoning, developed by knowledge-computing and released in late 2025. The benchmark pairs real-world map images with natural-language questions and reference answers to evaluate spatial reasoning capabilities.
A dataset for Visual Question Answering tasks, published on Kaggle. The dataset likely contains paired images and text questions with corresponding answers. Specific details on size, author, and last update are unknown.
A dataset named 'vqa-cv-rank8-32' published on Kaggle. Its title suggests a connection to Visual Question Answering (VQA) and ranking tasks, likely containing image-text pairs with ranking labels. The dataset's author, organization, size, and specific contents are unknown.
PathVQA-Turkish-Text is a dataset published on Kaggle. The title and platform tags suggest it likely contains Turkish-language text data associated with medical imagery for visual question answering tasks. The dataset's specific content, size, and provenance require verification after download.
m-Just's dataset comprises collages with a randomly placed 'core' image and a corresponding question-answer pair. This data was used to train the vSearcher model introduced in the research paper 'InSight-o3: Empowering Multimodal Foundation Models with Generalized Visual Search'. The dataset was last updated on Hugging Face in January 2026.
OpenThoughts-Agent-v1-RL provides approximately 720 curated reinforcement learning tasks designed for training agentic models, released by the open-thoughts project in January 2026. The collection includes instructions, environment configurations, and verifiers specifically optimized for benchmarks like Terminal-Bench 2.0 and SWE-Bench.
InspecSafe-V1 is a high-quality, multimodal annotated dataset for world model construction in industrial environments. The data was collected from real-world inspection robots deployed across industrial sites and has been cleaned and standardized. The dataset covers five representative industrial settings, including tunnels and power facilities.
Raw mass spectrometry (MS) and tandem mass spectrometry (MSMS) spectra used for ex vivo ovarian cancer typing and immunoscoring. Developed by LÊa Ledoux and hosted on Harvard Dataverse, the data supports surgical decision-making through multimodal machine learning. The collection was last updated in March 2026.
A dataset named SignVLM1, published on Kaggle. The title suggests it is likely related to sign language and vision-language models. Metadata is minimal; actual content requires verification after download.
Structured3D is a dataset of panoramic indoor scene images paired with text captions generated by the BLIP3 model. The dataset was created by KevinHuang and was last updated on February 5, 2026. The description notes missing caption files for several specific scene paths, indicating potential data completeness issues.