Loading...
Loading...
Image-text pairs, instruction tuning, visual QA, cross-modal data, foundation model training data
1,551 datasets
AGIEval is a human-centric benchmark for evaluating foundation models. This dataset contains the JEC-QA-CA subtask, which likely contains Chinese question-answering data. The dataset was processed from the AGIEval repository by the user 'hails' and was last updated on the Hugging Face platform on 2024-01-26.
Art-Free-SAM contains filtered image segment IDs from the original SA-1B dataset. The dataset pairs these segments with captions sourced from SAM-LLaVA-Captions10M, organized in a hierarchical folder structure. The dataset was authored by rhfeiyang and last updated on Hugging Face in December 2024.
ViP-Bench is a region-level multimodal model evaluation benchmark curated by the University of Wisconsin-Madison. It provides two kinds of visual prompts for testing model understanding: bounding boxes and human-drawn diverse visual prompts. The dataset was last updated on December 15, 2023.
An enriched version of the SROIE 2019 dataset adds labels for line descriptions and line totals to aid OCR and layout understanding. The training split contains 652 samples, each pairing an image with OCR text data. Arvindrajan92 published this multimodal dataset on HuggingFace in October 2022.
A custom collection of paintings, images, and photographs exhibiting various types of damage. The dataset was created via manual collection and semi-automated annotation, with an initial sweep using the BLIP model followed by manual refinement. It was last updated on May 5, 2024, by the author 'calm-and-collected'.
A formatted version of the TextVQA benchmark dataset, used for evaluating large multi-modality models. It was created by lmms-lab and last updated on March 8, 2024. The dataset is part of the lmms-eval pipeline for one-click model evaluations.
CogVLM-SFT-311K is the primary aligned corpus used in the initial training of CogVLM v1.0. The dataset contains approximately 311,000 bilingual visual instruction samples, constructed by selecting 3500 high-quality samples from MiniGPT-4, integrating them with LLaVA-Instruct-150K, and translating them into Chinese via a language model. The dataset was created by zai-org and last updated on December 26, 2023.
This benchmark contains millions of nature photographs paired with expert-level scientific queries for text-to-image retrieval tasks. It evaluates multimodal models on their ability to process complex biological and ecological inquiries against large-scale image collections to support scientific discovery.
A large-scale multimodal instruction tuning dataset for colonoscopy research, comprising over 300,000 colonoscopic images and 128,000 medical captions generated by GPT-4V. The dataset includes 62 categories and is designed to instruct models to execute user-driven tasks interactively. It was created by ai4colonoscopy and last updated on February 4, 2025.
A 2024 mixture of text preference datasets used to train the weqweasdas/RM-Mistral-7B reward model for Reinforcement Learning from Human Feedback. The dataset was created by OpenRLHF and includes multiple sources of human-annotated comparisons. It is designed for training models to score and rank text outputs based on human preferences.
PathVQA is a dataset for Medical Visual Question Answering built from the 'Textbook of Pathology' and 'Basic Pathology' textbooks. It contains question-answer pairs on pathology images, including both open-ended and binary yes/no questions.
LLM-jp, a collaborative project in Japan, provides this dataset. It is a Japanese translation of a 21,000-instruction English subset from the OASST1 dataset, created using the DeepL translation service. The dataset was last updated on February我们发现一个错误。根据输入,数据集标题是“Oasst1 21K Ja”,描述中提到它是“oasst1-21k-ja”,并说明是“Japanese translation of an English subset of oasst1”。因此,正确的摘要应基于此信息。输入中没有明确的行数“21,000”,但标题和名称暗示了“21k”。我将据此修正摘要。
Supplementary materials for a study on the effects of scale on multimodal deixis. The dataset includes gesture form coding from the study's first author and a reliability coder, along with annotation guidelines and an R script for statistical replication. The data was archived in the Texas Data Repository and last updated in March 2024.
LLM-jp provides a Japanese instruction-tuning dataset containing 33,000 entries. The dataset is a Japanese translation of a subset from the English OASST2 dataset, processed using DeepL. It was created by the LLM-jp collaborative project and last updated on April 28, 2024.
A dataset used to train the CoEdIT text editing models, as described in the paper 'CoEdIT: Text Editing by Task-Specific Instruction Tuning'. It was created by authors Vipul Raheja, Dhruv Kumar, Ryan Koo, and Dongyeop Kang and is hosted on Hugging Face by Grammarly. The dataset was last updated on October 21, 2023.
Clinician-generated question-answer pairs paired with radiology images across open-ended and binary 'yes/no' categories. The dataset utilizes medical imagery sourced from the MedPix open-access database to support the development of Medical Visual Question Answering (VQA) systems.
Polaris contains between 100,000 and 1,000,000 records of human feedback on image-caption pairs, released by researcher yuwd in 2024. This multimodal dataset supports the development of evaluation metrics that align with human judgment as described in the CVPR 2024 paper "Polos."
Screen2Words provides image captions for mobile application screens. It is built upon the RICO mobile app image database. The dataset was uploaded by rootsautomation to Hugging Face in April 2024.
High-quality images paired with descriptive text annotations, designed for computer vision and multimodal machine learning tasks. The dataset was created by Navanjana and last updated on May 21, 2025. Images are preprocessed to a standard dimension of 224×224 pixels in JPEG RGB format.
LSVQ is the largest dataset available for Non-reference Video Quality Assessment (NR-VQA), as stated in the description. This unofficial copy facilitates research after reports that original links are unavailable. The dataset was created by Ying et al. and published at CVPR in 2021.