Loading...
Loading...
Image-text pairs, instruction tuning, visual QA, cross-modal data, foundation model training data
1,551 datasets
42,678 Vietnamese images paired with detailed text descriptions and visual question-answering pairs generated by GPT-4o. The dataset includes spatial metadata for objects and text, covering specific attributes such as font style, color, and size within a Vietnamese linguistic context.
ChartX & ChartVLM is a benchmark and foundation model designed to evaluate the ability of Multi-modal Large Language Models to query and reason with information from visual charts. The dataset was created by InternScience and was last updated on Hugging Face in December 2024. It is intended to comprehensively and rigorously test chart understanding and reasoning capabilities.
A Korean language dataset constructed for supervised fine-tuning (SFT) of large language models as part of a Sungkyunkwan University industry-academic cooperation project. The dataset was created by preprocessing and filtering data from sources including Stanford Alpaca and OIG-Chip2 using ChatGPT-3.5 Turbo 16k to improve naturalness. The dataset page was last updated on 2023-09-25.
103 participants, including 30 homosexual men, 35 heterosexual men, and 38 heterosexual women, underwent multimodal MRI scans. Amirhossein Manzouri published this study in 2020, comparing cortical thickness, surface area, subcortical volumes, and resting-state functional connectivity across groups.
Encompassing biomechanical data from a study investigating how flamingos support their body on one leg with minimal muscle force. The research includes measurements from both cadaveric specimens and live flamingos, analyzing body sway and joint posture. The dataset is associated with a study published in 2020 by author Young-Hui Chang.
Multifaceted Collection is a dataset for aligning large language models to diverse human preferences, using system messages to represent individual preferences. The dataset was created by KAIST AI and released in June 2024. Instructions are sourced from five existing datasets.
MTBench is a multimodal time series benchmark for evaluating large language models in temporal and cross-modal reasoning. The dataset aligns high-resolution financial time series, such as stock prices, with textual context like news articles or QA prompts. It was created by GGLabYale and last updated on 2025-05-23.
Wind data collected at sites along Old Ingraham Highway near Flamingo, FL and C-111. The dataset includes date, time, wind speed, and direction, aimed at improving the treatment of wind forcing in hydrological models. It was collected by CEOS_EXTRA and last updated on 1997-12-31 23:59:59.999000.
A collection of controversial and adult-themed images for training multimodal detection models, curated by QuixiAI. The dataset is designed to enable detailed categorization and filtering of such content.
Pokémon BLIP captions is a multimodal dataset used to train a Pokémon text-to-image model. The dataset was created by author reach-vb and last updated on March 12, 2024. It contains Pokémon images from the FastGAN project paired with captions generated by the pre-trained BLIP model.
TRL's Sentiment and Descriptiveness Preference Dataset originates from an early RLHF paper by OpenAI. The data has been preprocessed into a standard prompt, chosen, rejected format for reinforcement learning from human feedback. The dataset was last updated on the Hugging Face platform on 2024-04-09.
10,000,000 image-caption pairs generated using the Florence-2 vision-language model for the Megalith-10M image collection. Textual descriptions supplement the previously uncaptioned CC-0 like images to support vision-language model training.
VisCon-100K is a dataset of 100,000 image-conversation samples designed for fine-tuning vision-language models. It is derived from 45,000 web documents in the OBELICS dataset, with captions generated by GPT-4V and converted into free-form conversations by OpenChat 3.5. The dataset was created by tiiuae and last updated on February 17, 2025.
Four meta-task categories including Screenshot Retrieval (SR), Composed Screenshot Retrieval (CSR), Screenshot QA (SQA), and Open-Vocabulary form the core of this Visualized Information Retrieval (Vis-IR) benchmark. The dataset utilizes digital screenshots to unify search and information extraction tasks across diverse application scenarios.
MMEB-V2 is a benchmark dataset for evaluating multimodal embedding models, created by VLM2Vec and updated in September 2025. It expands upon a previous version to include five new tasks: Video Retrieval, Moment Retrieval, Video Classification, Video Question Answering, and Visual Document Retrieval.
CVQA is a culturally diverse multilingual visual question answering benchmark consisting of over 10,000 questions from 39 country-language pairs. The dataset was constructed through a collaborative effort led by researchers from MBZUAI and is designed for use as a test set. It was last updated on November 27, 2024.
PathMMU is a massive multimodal expert-level benchmark for understanding and reasoning in pathology. It was released by author jamessyx on Hugging Face, with the benchmark data and evaluation code published on August 7, 2024. The dataset is intended to address the lack of specialized, high-quality benchmarks for large multimodal models in the pathology domain.
AlignMMBench is a multimodal alignment benchmark released in June 2024 by zai-org. It evaluates Chinese large vision-language models across single-turn and multi-turn dialogue scenarios. The dataset encompasses three categories and thirteen sub-tasks, as detailed in its associated arXiv paper.
An extension of the CommonCatalog CC-BY dataset with Japanese-language image captions. The author alfredplpl added one simple and three detailed captions per image, generated by a modified LLaVA-JP model. The dataset was last updated on June 23, 2024.
This multimodal fashion dataset provides image-text pairs annotated across categories, style, colors, materials, keywords, and fine-details. It is specifically curated to evaluate vision-language models like Marqo-FashionCLIP and Marqo-FashionSigLIP using fine-grained attribute metadata.