Loading...
Loading...
Image-text pairs, instruction tuning, visual QA, cross-modal data, foundation model training data
1,561 datasets
This collection aggregates multiple multimodal datasets and pre-computed visual features specifically curated for Visual Question Answering (VQA) and image captioning tasks. It provides a standardized interface for PyTorch users to access vision-language benchmarks through a dedicated Python package.
Over 10,000 artistic images from the WikiArt repository have been paired with descriptive captions generated by the BLIP model. This multimodal dataset was created by ChristophSchuhmann and uploaded to Hugging Face in May 2022. It combines visual art with machine-generated text descriptions.
A visual dataset of emoticons annotated using the image parsing capabilities of the glm-4v and step-1v multimodal AI models. The dataset was created by LLM-Red-Team and was last updated on April 27, 2024. The specific number of images, rows, and columns is unknown.
DocVQA Train is a dataset for visual question answering on document images. It was uploaded by Raagul04 to Hugging Face in July 2022. The dataset is intended for training models to answer questions based on visual content within documents.
Multimodal Sarcasm Detection is a dataset for detecting sarcasm from multiple data modalities, likely combining text and visual information. The dataset was created by author Carol99 and was last updated on Hugging Face in April 2022. Specific details on the number of samples, features, and collection method are not provided in the available metadata.
Image Captions is a multimodal dataset hosted on HuggingFace by csarron, last updated in November 2021. It pairs images with descriptive text, facilitating tasks that link visual and language understanding. The specific number of image-text pairs and source of the images are not detailed in the available metadata.
Image-text pairs for Italian Contrastive Language–Image Pre-training (CLIP). This data aligns visual representations with Italian linguistic descriptions to support cross-modal retrieval and zero-shot classification.
This repository provides a PyTorch implementation for deep learning cross-modal hashing. It was authored by WangGodder and last updated in October 2021. The specific dataset details, including row and column counts, are unknown.
This is version 1.0 of the ADVQA dataset, authored by HuggingFaceM4 and last updated in June 2022. The dataset's row count, column structure, and specific content are unknown.
NUSTM developed this multimodal dataset for explainable depression recognition in clinical interviews, with the most recent update occurring in January 2025. It provides data for affective computing research, specifically focusing on the intersection of mental health and machine learning interpretability.
1,000 3D object models featuring synchronized visual, acoustic, and tactile data. The collection includes 3D meshes, simulated impact sounds, and high-resolution tactile images generated via the DIGIT sensor simulation.
1 curated collection of multi-modal Prognostics and Health Management (PHM) resources organized into fault diagnosis and fault prediction categories. The content addresses the integration of diverse sensor data for industrial equipment health monitoring.
10,921 high-resolution remote sensing images collected from satellite imagery sources, each paired with 5 descriptive natural language captions. The dataset covers 30 distinct scene categories, including airports, bridges, and residential areas, totaling approximately 54,605 caption-image pairs.
DocVQA is a dataset for document visual question answering, created by nlpconnect and hosted on Hugging Face. The dataset was last updated in May 2022, though specific details on its size and composition are not provided in the available metadata.