Loading...
Loading...
Image-text pairs, instruction tuning, visual QA, cross-modal data, foundation model training data
1,539 datasets
cudabenchmarktest created this dataset for fine-tuning models derived from the Qwen3.5-9B architecture. The description includes a critical warning about a required inference flag to prevent high empty-answer rates when serving models via Ollama. The dataset was last updated on April 15, 2026.
DoseRAD2026 is a large-scale, multimodal dataset for radiotherapy research, created by LMUK-RADONC-PHYS-RES. It is designed to support the development and benchmarking of fast and accurate radiation dose calculation and prediction methods. The dataset was last updated on April 17, -2026.
A collection of biomedical image segmentation datasets packaged for the W-PVLMedSeg project. The datasets are organized into train, validation, and test splits with corresponding image and label folders. The repository was created by DanRuguo and last updated on 2026-04-27.
NuTonic/sat-bbox-metadata-sft-v1 is a metadata-first dataset built for training multimodal chat models. It likely contains Sentinel-2 satellite image chips paired with JSON metadata and optionally Mapbox stills. The dataset was created by NuTonic and last updated on April 28, 2026.
A multimodal design-to-code benchmark built from community Figma designs, integrating screenshots, structured metadata, and design assets. The dataset was created by xcodemind and last updated on April 29, 2026.
RoomTour3D provides video frames subsampled at 3 frames per second from YouTube room tour videos. The frames are downsampled with a shorter side of 360 pixels. The dataset was created by author 'roomtour3d' and was last updated on April 23, 2026.
Minimind V Dataset is a multimodal collection for training vision-language models, assembled by jingyaogong from sources including Chinese-LLaVA-Vision, llava-en-zh-300k, and LLaVA-SFT-665K. It contains approximately 570,000 pre-training images and 965,000 instruction-following data points, with content in both English and Chinese. The dataset was last updated on Hugging Face on April 4, -2026.
VisionFoundry-10K is a synthetic visual question answering dataset containing 10,000 image-question-answer triples. The data was created by the VisionFoundry pipeline, which uses an LLM to generate task-aware content and a text-to-image model to synthesize images, with samples filtered by a multimodal verifier. It was authored by zlab-princeton and last updated on Hugging Face in April 2026.
A metadata-first, procedural VLM SFT dataset built from an existing 'sat-bbox' style dataset tree. The dataset, created by NuTonic, was last updated on 2026-04-30. It is designed to provide high-signal supervision for multimodal chat models, using Sentinel-2 satellite chips paired with JSON metadata and optional Mapbox stills.
Agentic-MME is an official benchmark dataset featured in Hugging Face Daily Papers. It is designed to evaluate multimodal agents in tool-use, web searching, and multi-step reasoning through visual clues. The dataset was created by Agentic-MME and last updated on April 11, -2026.
A synthetic dataset parameterized from published Sub-Saharan African literature, not real observations. It is a multimodal bone fracture classification dataset designed for African healthcare contexts, created by electricsheepafrica and last updated on April 14, 2026.
Zeng's corpus contains annotation files and coding templates for analyzing institutional Spanish tourism videos. The dataset includes ELAN annotation files and Excel coding templates, with materials available from the author upon request. It was last updated on April 6, 2026.
This reproducibility package contains processed data and outputs for a dynamic-t multimodal landmark survival prediction framework for multiple myeloma. It includes processed MMRF CoMMpass resources, external validation resources, and training/evaluation outputs, supporting reproducible model training, benchmarking, and visualization. The framework utilizes laboratory time-series, drug exposure, and imaging-derived features for survival modeling.
A study of female zebra finches tested for responses to audio and visual stimuli from mates or strangers. The dataset includes processed data for statistical analysis, plus raw coordinate data for beak tip, head, and back positions during pre-stimulus and playback periods. Data was collected by Sarah Woolley and harvested from Borealis Dataverse in April 2026.
13,999 frames sampled from 157 movies released between 1982 and 2023. The dataset is annotated with grounded scene graphs and 16 safety tags for evaluating vision-language models, created by author fcakyon and last updated on April 27, 2026.
A 7.7 GB dataset integrating Perturb-seq and optical pooled screening to map cellular responses and enable cross-modal inference. The dataset is stored in H5AD format and was authored by Romain Lopez, last updated on 2026-04 18.
A Bengali translation of the VQA v2.0 dataset created for research in Visual Question Generation. The dataset contains Bangla questions and answers aligned with images, along with the original English annotator answers. It was published by Tahsin-Mayeesha in 2023 as part of the work "Visual Question Generation in Bengali" presented at MM-NLG.
Lo6yu's Egocentric Multimodal Daily-Life RGB-D EMG IMU Dataset captures synchronized egocentric daily-life manipulation data. The dataset combines RGB-D video for scene context and hand motion with bilateral wrist EMG for muscle activation and wrist IMU signals. It was last updated on 2026-05 02 19:30:11.
Preference data tracks changes in female choice for an initially unpreferred male across learning and copying tests. The dataset includes binary indicators for preference increase and test phase identifiers. Authored by Marina Hutchins and last updated in April 2026.
A technical report released on 2026-04-30 details an approach for Multimodal Large Language Models (MLLMs) to bridge the 'Perception Gap'. The dataset, uploaded by NodeLinker, is intended to include in-house benchmarks and a subset of cold-start data for future public release, with model weights planned for integration into a foundation model.