Loading...
Loading...
Image-text pairs, instruction tuning, visual QA, cross-modal data, foundation model training data
1,561 datasets
1,500,000 images representing Wikipedia entities curated for the Visual Question Answering over Entities (ViQuAE) benchmark. These images serve as a visual knowledge base for tasks requiring models to link visual inputs to external structured information and natural language questions.
A filtered dataset likely containing visual question-answering data derived from Stack Overflow content. It was published on the Hugging Face platform by the author mirzaei2114 and was last updated on December 2, 2023. The specific content, scale, and filtering criteria are not detailed in the available metadata.
3,700 question-answer pairs paired with images and a retrieval corpus of 1.5 million Wikipedia passages. The dataset focuses on entity-centric visual question answering, requiring models to identify visual entities and retrieve external knowledge to provide answers.
A curated collection of human preference datasets across three categories: fine-tuning, RLHF, and evaluation. This repository indexes resources specifically designed for training and benchmarking Large Language Models against human-labeled preferences.
MMC is a multimodal instruction-tuning dataset for chart understanding published by Fuxiao Liu for NAACL 2024. It provides visual chart data paired with natural language instructions to improve the reasoning capabilities of large language models across various chart types, including stock market visualizations.
A dataset likely containing human preference data for Reinforcement Learning from Human Feedback (RLHF) applications. It was published by author liyucheng on the Hugging Face platform on April 15, 2023. The dataset's title suggests a connection to the Chinese Q&A platform Zhihu and a scale of approximately 3,000 entries.
BenchLMM evaluates the cross-style visual reasoning capabilities of Large Multimodal Models (LMMs) across diverse image styles. Developed by AIFEG and presented at ECCV 2024, this benchmark assesses how models generalize beyond standard natural images to various artistic and synthetic domains.
This audio-text dataset provides paired audio signals and descriptive captions for the first Audiocaption task, released by RicherMans in 2024. It serves as a benchmark for automated audio description systems and includes baseline code for performance evaluation.
LanguageBind published a dataset titled 'Video Llava' on the HuggingFace platform in January 2024. The dataset likely contains video and text data for training or evaluating multimodal AI models. Specific details on size, format, and content are not provided in the available metadata.
Llavacot Think is a multimodal dataset containing image-text pairs, categorized as having between 10,000 and 100,000 samples. Created by ahmedheakl, it was last updated in March 2025.
4,241 multimodal science questions representing the test split of the ScienceQA benchmark. It contains image-based multiple-choice questions accompanied by hints, lectures, and step-by-step explanations across natural, social, and language science subjects.
10,000+ hyper-detailed image descriptions and object-level annotations derived from the Open Images dataset. The data includes fine-grained attributes, spatial relationships, and dense scene narratives designed to improve vision-language model alignment.
25,000 multimodal examples likely containing images paired with text instructions and chain-of-thought reasoning. The dataset was created by author 'tomkld' and last updated on Hugging Face on December 10, 2024. Its columns suggest it contains image and text data for training vision-language models.
Aggregating captioned cartoons, combining image and text modalities. It was authored by juliaturc and last updated on November 8, 2022. The dataset is tagged with an US region focus and includes Parquet file formats.
A dataset from Anthropic, published on HuggingFace by user 'nz' and last updated on February 2, 2024. The title suggests it contains data for Reinforcement Learning from Human Feedback (RLHF), a technique for aligning language models. The specific content, scale, and structure require verification after download.
Mp DocVQA is a multimodal dataset for document visual question answering, created by the lmms-lab. It contains image-text pairs where questions are posed about document images. The dataset was last updated on Hugging Face in February 2024.
464 multimodal earnings conference calls from S&P 500 companies featuring sentence-level alignment between audio recordings and text transcripts. The dataset provides structured financial disclosures paired with stock volatility labels for modeling market risk responses.
Parsa-ra developed this multi-modal dataset interface on GitHub, with the last update recorded in April 2024. It serves as a practice for unified data handling, though specific record counts and file formats are currently undocumented in the repository metadata.
Hindi-language dataset for visual question answering tasks, published on Hugging Face by author 'azharumo'. The dataset was last updated on September 17, 2024. Its specific size, structure, and annotation details require verification after download.
A filtered version of the Dolly dataset, designed for instruction tuning of large language models. The dataset was created by qingy2024 and was last updated in November 2024. It contains text data categorized for fine-tuning tasks.