Loading...
Loading...
Image-text pairs, instruction tuning, visual QA, cross-modal data, foundation model training data
1,539 datasets
An estimated 75β100 million people speak Nigerian Pidgin English, yet no production-ready, commercially licensed dataset existed for it. The Nigerian Pidgin Voice + Text Dataset is a multimodal collection built by WAZOBIALABS to fill critical gaps that cause voice AI to fail for Nigerian users. It was last updated on Hugging Face in April 2026.
Korean Visual Document Retrieval Hard Negatives is a multimodal training set for fine-tuning embedding models. The dataset, created by whybe-choi, was last updated on 2026-04-25. Each row contains a text query, a page image document, one positive match, and seven mined hard negatives.
Vietnamese Visual Question Answering dataset focused on medicinal plants and herbs. It was developed for scientific research at Ton Duc Thang University (TDTU) to advance AI models for herb recognition and question answering. The dataset was last updated on Hugging Face in April 2026.
A11y-CUA is a multimodal dataset containing real computer-use task trajectories from sighted users, blind and low vision users, and AI agents. The dataset includes structured interaction logs, metadata, screen video, and system audio for each task. It was created by berkeley-hci and was last updated on Hugging Face in April 2026.
Evaluation outputs for studying metric inconsistency in multimodal machine unlearning, supporting reproducibility for a NeurIPS 2026 paper. The dataset contains results on VQA benchmarks (MLLMU-Bench, UnLOK-VQA, MMUBench) and CIFAR-10 baseline results. It was created by author 'neurips26' and last updated on 2026-05-01.
The Strawberry Disease Multimodal Dataset by Qin2006 contains strawberry image data paired with environmental parameters and variety information. It is designed for studying correlations between environmental factors and disease occurrence, as well as multimodal fusion detection algorithms. The dataset was last updated on HuggingFace on 2026-04-19.
A multimodal dataset for strawberry disease detection contains strawberry image data, corresponding environmental parameters (air temperature, air humidity, soil moisture) and strawberry variety information. It can be used to study the correlation between environmental factors and strawberry disease occurrence, as well as multimodal fusion disease detection algorithms. The dataset was authored by Qin2006 and last updated on 2026-04-19.
An overview of the MedMNIST+ 2D benchmark datasets used to evaluate the BioFuse embedding fusion framework. The dataset is 9.5 KB in size, authored by Mirza Nasir Hossain, and was last updated on March 18, 2026. It was used to test a framework that fuses embeddings from 9 state-of-the-art foundation models to achieve high performance on biomedical image classification tasks.
En Vdr Hn is a multimodal retrieval training set for fine-tuning visual-document embedding models on English document pages. The dataset, created by whybe-choi and last updated on 2026-04 26, provides query text and page image pairs, with each row containing one positive and seven mined hard negatives. Hard negatives were mined using the Qwen/Qwen3-VL-Embedding-8B model within each source dataset.
DocVQA Media Labeled Clean is a dataset hosted on Hugging Face by author merve. The dataset was last updated on June 5, 2026. Its specific content and scale are unknown from the provided metadata.
MERRIN is a human-annotated benchmark for evaluating search-augmented agents on multi-hop reasoning over noisy, multimodal web sources. It measures agents' ability to identify relevant modalities, retrieve evidence from the open web, and reason over conflicting sources spanning text, images, video, and audio. The dataset was created by HanNight and was last updated on 2026-04-16.
A figshare document by Yuting Yi, last updated in March 2026, details a single clinical case of a 54-year-old woman with a rare cardiac condition. The 19.6 KB file includes a case report and a systematic literature review identifying 13 reported cases of ApHCM with calcification. The document integrates clinical presentation, multimodality imaging, histopathology, and genetic analysis findings.
TAB-VLM is a benchmark dataset containing 600 examples designed to measure cultural anachronism in Vision-Language Models. It was created by authors Mukul Ranjan, Prince Jha, Khushboo Kumari, and Zhiqiang Shen, with a paper accepted for ACL 2026 Findings. The dataset assesses the tendency of models to misinterpret historical artifacts using temporally inappropriate concepts.
InternData-A1 contains over 630,000 trajectories and 7,433 hours of robotic manipulation data across 4 embodiments and 227 scenes. Created by InternRobotics and documented in Arxiv 2511.16651, it provides a hybrid synthetic-real collection covering 18 skills and 70 tasks.
ChartNet-Bench is a benchmark dataset containing 3,807 chart images for evaluating faithful multimodal chart understanding. It includes 2,000 synthetic charts and 1,807 real-world charts, all human-verified. The benchmark supports tasks like chart-to-CSV extraction, summarization, and hallucination detection.
PianoVAM v1.1 is a multimodal dataset containing piano performances, including video, audio, and MIDI data. The dataset was initially released for ISMIR 2025 and is maintained by the PianoVAM organization. The current version includes corrections for video-MIDI synchronization issues.
Bordair Multimodal Prompt Injection Dataset contains 62,063 labeled samples for training and evaluating prompt injection detectors. The dataset, created by Bordair and last updated in April 2026, includes 38,304 attack and 23,759 benign samples covering cross-modal, multi-turn, and evasion attack types. All samples are source-attributed to peer-reviewed papers or documented industry research and are labeled with an expected_detection flag.
PolyCap is a dataset of image-grounded captions for the MindSemantix project. The dataset, created by author ziqiren, was last updated on HuggingFace on 2026-05-12. It contains caption files for subjects sub01, sub02, sub05, and sub07, with corresponding COCO captions referenced to be obtained from the separate NSD dataset.
Aishwarya Iyer published multimodal characterization data for the cardiac muscle protein sMyBP-C M-domain on figshare in May 2026. The dataset likely contains structural, functional, and dynamic measurements of the protein. It specifically examines the impact of a novel pathogenic mutation.
WithinUsAI's 'Geminipro3.2 Max Distill God Seed 25K' is a dataset of 25,000 examples engineered for distilling large language models. The dataset aims to imbue base models with capabilities described as deep scientific reasoning, long-context understanding, and thoughtful calibration. It was last updated on the Hugging Face platform on April 23, 2026.