Loading...
Loading...
Image-text pairs, instruction tuning, visual QA, cross-modal data, foundation model training data
1,560 datasets
Title and encoded image pairs from Medium articles, derived from a Kaggle dataset of 128,000 articles. The images were centrally cropped to a square and resized to 256x256 pixels before being encoded into image tokens.
LLM-jp, a collaborative Japanese project, created this synthetic dataset for instruction tuning. It contains a subset of the 801,000-instruction Aratako/Synthetic-JP-EN-Coding-Dataset. The dataset was last updated in January 2025.
Datasetloom is an open-source platform for constructing and evaluating datasets for multimodal large language models (VLMs), developed by 599yongyang and updated in December 2025. It provides a full-stack framework using TypeScript, Next.js, and NestJS to streamline the creation of training data for vision-language tasks.
A PyTorch-based implementation of the OpenAI CLIP architecture for image-text alignment, authored by Moein Shariatnia and updated in October 2025. It provides a dual-encoder framework for processing image-text pairs using BERT for natural language processing and Vision Transformer components.
30 patients with basal cell carcinomas contributed to this multimodal dataset of paired reflectance confocal microscopy images and Raman spectra. The data was collected via point-by-point scanning and is authored by Khan, Fadeel Sher, hosted by the Texas Data Repository. The dataset was last updated on March 18, 2024.
Featuring 30,000 sarcastic tweets paired with GIF reactions. It was created for research on predicting induced affect, as detailed in an ACL 2021 paper by Shmueli, Ray, and Ku.
The MDocAgent dataset supports a framework for multi-modal document understanding, as described in the associated arXiv paper. The dataset was created by Lillianwei and last updated on August 22, 2025. It is hosted on Hugging Face and is associated with a GitHub repository containing the framework's code.
Ruozhiba, a popular forum on Baidu Tieba known for short, witty content, provides this raw collection of posts. The dataset was created by user 'kirp' and last updated in October 2024. It contains an unspecified number of posts scraped from the forum up to November 10, 2023.
A dataset of image captions depicting unsafe or illegal activities, hosted on HuggingFace. The dataset was created by Lenkashell and was last updated on July 16, 2025. The specific content, scale, and structure of the data are not detailed in the available metadata.
Hugging Face released Chug in April 2024 to provide sharded dataset loaders and decoders for multi-modal document, image, and text data. It focuses on efficient distributed training using WebDataset and PDF formats for computer vision and document understanding tasks.
73,893 short videos from the TRECVID VTT task, each ranging from 3 to 10 seconds in duration. The dataset includes between 2 and 5 human-written captions per video, created by dedicated annotators hired by NIST.
Three categories of preference dataβtoxid-dpo-natural-v4, rawrr v2-1 stage 2, and no_robotsβcomprise this merged dataset. The samples focus on human-like conversational responses to prevent models from overfitting to rigid instruction-following templates.
Stvqa 7K is a dataset referenced in a paper available on arXiv. The dataset is hosted on the HuggingFace platform by the author OX-PIXL and was last updated on November 12, 2025. Its specific content and scale are not detailed in the provided metadata, but platform tags suggest it relates to vision-language tasks.
A dataset of image captions related to illegal activities, created by Lenkashell and last updated on July 16, 2025. The dataset is hosted on HuggingFace, but its specific content, size, and structure are not detailed. Its intended purpose appears to be for training or evaluating content safety models.
A combined dataset for medical visual question answering, merging the VQARAD and SLAKE collections. The dataset was created by Shashwath01 and was last updated on March 5, 2024. It has been used to train a specific model hosted on Hugging Face.
PKU-Alignment developed this dataset to facilitate Constrained Value Alignment through Safe Reinforcement Learning from Human Feedback (Safe RLHF). It provides human-annotated preference data for Large Language Models, specifically targeting the balance between helpfulness and safety constraints as of late 2024.
A Chinese preference dataset developed for alignment with human values, as described in the associated research paper. The dataset was created by author m-a-p and was last updated on HuggingFace on 2025-04-15. Its specific scale and content are detailed in the paper 'COIG-P: A High-Quality and Large-Scale Chinese Preference Dataset for Alignment with Human Values'.
Psych 101 Test is a text-based evaluation suite containing between 1,000 and 10,000 records for benchmarking human cognition models. Created by Marcel Binz in 2024, it serves as the private test set for the 'Centaur' foundation model research.
A French translation of the Anthropic HH-RLHF dataset, created to support alignment research in the French NLP community. The dataset was uploaded by AIffl and last updated on June 15, -2024. Its specific size, row count, and column structure are not detailed in the provided metadata.
OK-VQA contains 14,055 open-ended visual questions, each with 5 ground truth answers. The dataset is manually filtered to ensure all questions require outside knowledge, such as from Wikipedia, and has been processed to reduce bias from common answers.