Loading...
Loading...
Image-text pairs, instruction tuning, visual QA, cross-modal data, foundation model training data
1,560 datasets
A collection of multimodal image-text pairs, with each sample including an image or image URL, associated text strings, a source identifier, and JSON-formatted metadata. The dataset was created by HuggingFaceM4 and was last updated in June 2022.
A modified version of the Amazon Multimodal Product dataset, slimmed for training multimodal LLMs. The dataset includes product descriptions generated using the Gemini Flash model. It was created by philschmid and last updated in September 2024.
2 document images from the DocVQA dataset serve as fixtures for the HuggingFace Transformers library. These samples facilitate the testing of LayoutLMv2FeatureExtractor and LayoutLMv2Processor across specific unit test files.
The AMTTL dataset is a monolingual Chinese text collection for token classification tasks, created by author gavinxing and last updated in January 2024. It is categorized as containing between 1,000 and 10,000 instances (1K<n<10K) and has crowdsourced annotations.
RedCaps is a dataset of 12 million image-text pairs collected from Reddit. It is designed for image-to-text tasks and was created by Karan Desai and colleagues.
A collection of multi-turn guessing games utilizing VisualGenome images and scene graphs for attribute grounding tasks. The data serves as a multi-task framework to evaluate the quality of neural representations through object identification and visual dialogue.
Pexels provided over 10,000 photographs of buildings and unique architecture in 2023. The dataset creator 'lodestones' used the CogVLM model to generate descriptive captions for each image. The dataset was last updated on the Hugging Face platform in June 2024.
This dataset inherits from the original Anthropic/hh-rlhf collection and has been formatted using the OpenAI chat convention for Direct Preference Optimization (DPO) fine-tuning. Each conversational response has been labeled for safety using the LLaMa Guard model. The dataset was uploaded by author javirandor and last updated on March 28, 2025.
Zaynoid published a dataset titled 'Vlm Train 1K' on the Hugging Face platform on 2025-12-14. The title suggests it is likely a collection of 1,000 items for training vision-language models. The specific content, format, and structure require verification after download.
12 million image-text pairs sourced from 350 manually curated subreddits covering diverse objects and scenes. The dataset utilizes subreddit names as coarse labels to guide composition without requiring manual per-instance annotation.
This dataset supports the FETA research paper, which was published as a main conference paper at NeurIPS 2022. It is used for specializing foundation models for expert task applications, with the official resources available on a dedicated GitHub repository.
Face Synthetics Spiga Captioned is a copy of the Microsoft FaceSynthetics dataset enhanced with SPIGA-calculated facial landmark annotations and BLIP-generated text captions. The dataset, created by multimodalart and last updated in March 2023, is designed for multimodal tasks involving synthetic facial imagery.
Conceptual 12M contains 12 million image-text pairs intended for vision-and-language pre-training. It was created by Google Research using a relaxed version of the data collection pipeline from Conceptual Captions 3M.
The Tumblr GIF (TGIF) dataset contains 100,000 animated GIFs and 120,000 descriptive sentences. GIFs were collected from randomly selected Tumblr posts published between May and June 2015, with sentences gathered via a crowdsourced annotation interface. It is designed for evaluating animated GIF and video description techniques.
HowTo100M contains 136 million narrated video clips sourced from 1.2 million YouTube instructional videos spanning 15 years. The dataset focuses on videos where creators teach complex tasks, covering 23,000 activities in domains like cooking, crafting, and fitness.
Mind2Web Live provides approximately 1,000 records for web navigation and interaction tasks, released by iMeanAI in October 2024. The dataset focuses on text-based web environments and is formatted for integration with modern data libraries like Polars and Dask.
Evaluation benchmarks for the Video-R1 model across video reasoning categories, including test sets for temporal and causal logic. The dataset provides the data required to replicate the reasoning performance results presented in the 'Video-R1: Reinforcing Video Reasoning in MLLMs' research paper. It is designed to test the logical and temporal inference capabilities of Multimodal Large Language Models.
Math-PUMA created this dataset to enhance mathematical reasoning through progressive upward multimodal alignment, as described in a 2024 arXiv preprint. The dataset contains English text focused on mathematics and reasoning tasks. Specific details on size, rows, and columns are not provided in the input.
Designed for text classification tasks, specifically sentiment classification, with a size category of 1K to 10K instances. It contains monolingual Russian text data, created by the author Aniemore.
Published on huggingface by author zaiquan and last updated on 2025-12-04. The dataset likely contains multimodal data for spatio-temporal video grounding tasks, which involve linking language queries to specific objects and time segments in videos. Its specific content, scale, and collection methodology require verification after download.