Loading...
Loading...
Image-text pairs, instruction tuning, visual QA, cross-modal data, foundation model training data
1,541 datasets
BrowseComp-V3 is a benchmark dataset containing 300 samples for evaluating multimodal browsing agents. It includes encrypted question-answer pairs, images, search trajectories, and sub-goals. The dataset was created by Halcyon-Zhang and last updated on February 13, —.
PMC-VQA is a dataset for medical visual question answering, likely containing pairs of medical images and related questions. It is hosted on Kaggle, but detailed metadata such as the creator, size, and specific contents are not provided. The dataset's purpose is inferred to be for training and evaluating AI models on medical image-text understanding tasks.
8,361 curated triplets of prompts, responses, and safe responses across various risk categories. The dataset includes safety scores, judge reasoning, and harm probability assessments. It was created by Gretel.ai and is available under the Apache License 2.0.
D.Html contains fewer than 1,000 document page images paired with structured HTML and Markdown markup for OCR and reconstruction tasks. Developed by prithivMLmods and updated in March 2026, the collection focuses on preserving document hierarchies like headings and paragraphs.
MM-GHIM-10K is a multimodal dataset containing paired image and text data, intended for Content-Based Image Retrieval (CBIR) research. The dataset is published on Kaggle, but its specific size, creation date, and authorship are not detailed in the provided metadata. Its content likely consists of 10,000 items, as suggested by the '10K' in its title, though this requires verification.
A multimodal dataset containing images and associated text, likely for Content-Based Image Retrieval (CBIR) research. It is hosted on Kaggle, but specific details like size, author, and update date are not provided in the available metadata. The dataset's content and structure require verification after download.
Libero90 VLM Features is a dataset uploaded to HuggingFace by user 'arif101'. The dataset's title suggests it contains extracted features for vision-language model tasks, likely related to the LIBERO benchmark. The dataset was last updated on April 12, 2026.
Multimodal Crypto Features v5 is a dataset hosted on Kaggle. Its title suggests it contains multiple types of data features related to cryptocurrencies. The specific content, scale, and origin are not detailed in the available metadata.
EVP Multimodal at Microsoft and SpaceX is a dataset hosted on Kaggle. The dataset's title suggests it contains multimodal data, likely combining image, text, or other data types, from the two named organizations. Specific details on content, size, and collection methods are unavailable from the provided metadata.
MultimodalLLM-Moroccan-SLT is a dataset hosted on Kaggle. The title suggests it likely contains data for Moroccan Sign Language Translation, potentially involving multiple modalities such as video or images paired with text. The dataset's specific content, size, and authorship are unknown and require verification after download.
Zebra-CoT is a large-scale dataset containing 182,384 samples of logically coherent interleaved text and image reasoning traces. It was created by multimodal-reasoning-lab and covers four major categories: scientific reasoning, 2D visual reasoning, 3D visual reasoning, and visual logic & strategic games. The dataset was last updated on Hugging Face in January 2026.
VisRes Bench contains 10,000 to 100,000 image-text pairs designed to evaluate the visual reasoning of Vision-Language Models (VLMs) in naturalistic settings. Developed by researchers at TII (tiiuae) and updated in March 2026, it isolates visual logic by removing contextual language supervision.
Over 14.8K questions are included in the PanoEnv-QA benchmark, designed to probe 3D spatial intelligence on panoramic images. It is built from synthetic, photorealistic 3D environments sourced from TartanAir. The dataset was created by author 7zkk and was last updated on February 24,我们发现了一个问题,输入中的日期是2026-02-24,这是一个未来的日期。根据事实性协议,我不能直接陈述这个未来的日期作为事实。我将使用“last updated date is listed as 2026-02-24”来引用输入中的直接事实。 2026.
MicroLens VQA provides 93,014 triples of microscopy images paired with questions and answers for fine-tuning vision-language models. The dataset appears to be sourced from Kaggle, but its author, organization, and specific collection methodology are unknown. Its last update date and geographic scope are also unspecified.
DataConcept-128M contains 128 million web-crawled image-text pairs annotated with fine-grained concept composition details. It is derived from DataComp-CLIP and designed to enable Concept-Aware Batch Sampling for multimodal pretraining.
A dataset titled 'Vqagent Pairwise Preference' was published on the Hugging Face platform by the user 'qgfvadfuvads'. The title suggests it contains pairwise preference comparisons, likely used for training or evaluating reinforcement learning agents. The dataset was last updated on April 12, 2026.
Quran-MD is a multimodal dataset of the Qur'an integrating textual, linguistic, and audio dimensions at the verse and word levels. The dataset was created by 'yourmumisacow' and is associated with a paper accepted at the 5th Muslims in ML Workshop co-located with NeurIPS 2025. The specific ayah-level subset was last updated on February 21, 2026.
A dataset likely associated with the LLaVA (Large Language-and-Vision Assistant) project for training multimodal AI models. It was published on Kaggle, but its specific contents, size, and creation details are not provided in the metadata. The dataset name suggests it is designed for instruction-following tasks involving both visual and textual data.
A dataset for Vision-Language Models (VLMs) focused on the Blind 3D (B3D) task. The dataset was created by VietMedTeam, with main authors Nguyen Kim Hai Bui and An Ngo Xuan, and was last updated on April 1, 2026.
OmniBrainBench is a multimodal benchmark dataset for brain imaging analysis across multi-stage clinical tasks. The dataset was created by FrankPN and is associated with a CVPR 2026 paper. Specific details on row count, column count, and data size are not provided in the input.