Loading...
Loading...
Image-text pairs, instruction tuning, visual QA, cross-modal data, foundation model training data
1,560 datasets
Ai2D contains between 1,000 and 10,000 scientific diagrams with corresponding text annotations, published by Aniruddha Kembhavi and the Allen Institute for AI in 2016. The dataset is designed to support research in diagrammatic reasoning and visual question answering within the scientific domain.
The TextVQA dataset contains 45,336 questions based on 28,408 images from the OpenImages collection. It requires models to read and reason about text present within images to answer the provided questions.
20,000 samples combine questions and images from three established VQA datasets: AOKVQA, Path-VQA, and TDIUC. This medium-sized benchmark is designed to test the multi-domain knowledge of vision-language models. It was created by dutta18 for educational and research purposes, with copyright retained by the original dataset owners.
141 million interleaved image-text web documents containing 115 billion text tokens and 353 million images comprise the OBELICS collection. Created by Hugging Face and updated in 2024, it serves as a massive open-source resource for multimodal AI development.
Human Preference Dataset v2 (HPD v2) is a large-scale collection of human preference choices on images generated by text-to-image models. It contains 798,000 preference choices across 430,000 images. The dataset was created by ymhao and was last updated on February 21,ζ们εη°δΊδΈδΈͺιθ――γ
A dataset of Pokemon image-text pairs was removed from the Hugging Face platform on March 20, 2024. The takedown was initiated by The PokΓ©mon Company International, Inc. via a DMCA notice, and the dataset author is listed as 'lambda'.
This DPO dataset contains pairs of harmful prompts and model responses derived from the LLM-LAT/harmful-dataset. It reconfigures the preference structure by labeling standard model refusals as 'rejected' and the original harmful or incorrect answers as 'chosen'.
ImageRewardDB is a text-to-image human preference dataset containing 137,000 expert comparison pairs. It was created by zai-org and uploaded to Hugging Face on June 21, 2023. The dataset is built from text prompts and corresponding model outputs sourced from DiffusionDB.
A dataset of short, natural Vietnamese dialogues for fine-tuning language models like Mamba, LLaMA, and Gemma. It contains everyday communication, frequently asked questions, and emotional responses, formatted as JSONL and ready for instruction tuning. The dataset was created by hoanghai2110 for the Vietnamese open-source AI community.
Offering image features extracted from the Flickr8k dataset using a ResNeXt-152 C4 architecture. It includes Arabic and English captions and splits provided by ElJundi et al., intended for use with the OSCAR learning method.
Presenting a reformatted version of theblackcat102/llava-instruct-mix, prepared for Vision Supervised Fine-Tuning (VSFT) with the TRL SFT Trainer. It is designed for instruction tuning of multimodal vision-language models. The dataset's author is HuggingFaceH4, and it was last updated in April 2024.
The Tumblr GIF (TGIF) dataset contains 100,000 animated GIFs and 120,000 descriptive sentences. GIFs were collected from randomly selected Tumblr posts published between May and June 2015, with sentences gathered via a crowdsourced annotation interface. It is designed for evaluating animated GIF and video description techniques.
Title and encoded image pairs from Medium articles, derived from a Kaggle dataset of 128,000 articles. The images were centrally cropped to a square and resized to 256x256 pixels before being encoded into image tokens.
LLM-jp, a collaborative Japanese project, created this synthetic dataset for instruction tuning. It contains a subset of the 801,000-instruction Aratako/Synthetic-JP-EN-Coding-Dataset. The dataset was last updated in January 2025.
Datasetloom is an open-source platform for constructing and evaluating datasets for multimodal large language models (VLMs), developed by 599yongyang and updated in December 2025. It provides a full-stack framework using TypeScript, Next.js, and NestJS to streamline the creation of training data for vision-language tasks.
A PyTorch-based implementation of the OpenAI CLIP architecture for image-text alignment, authored by Moein Shariatnia and updated in October 2025. It provides a dual-encoder framework for processing image-text pairs using BERT for natural language processing and Vision Transformer components.
30 patients with basal cell carcinomas contributed to this multimodal dataset of paired reflectance confocal microscopy images and Raman spectra. The data was collected via point-by-point scanning and is authored by Khan, Fadeel Sher, hosted by the Texas Data Repository. The dataset was last updated on March 18, 2024.
Featuring 30,000 sarcastic tweets paired with GIF reactions. It was created for research on predicting induced affect, as detailed in an ACL 2021 paper by Shmueli, Ray, and Ku.
The MDocAgent dataset supports a framework for multi-modal document understanding, as described in the associated arXiv paper. The dataset was created by Lillianwei and last updated on August 22, 2025. It is hosted on Hugging Face and is associated with a GitHub repository containing the framework's code.
Ruozhiba, a popular forum on Baidu Tieba known for short, witty content, provides this raw collection of posts. The dataset was created by user 'kirp' and last updated in October 2024. It contains an unspecified number of posts scraped from the forum up to November 10, 2023.