Loading...
Loading...
Image-text pairs, instruction tuning, visual QA, cross-modal data, foundation model training data
1,551 datasets
61,000+ multimodal samples across text, video, and audio modalities from nine datasets. The MMLA benchmark, created by THUIAR, includes data from films, TV series, YouTube, Vimeo, Bilibili, TED, and improvised scripts for evaluating foundation models.
A processed dataset for training a foundation model for joint segmentation, detection, and recognition of biomedical objects. Each instance includes a 1024x1024 PNG image, a list of textual descriptions for the segmentation target, and a corresponding 1024x1024 binary ground truth mask. The dataset is hosted by Microsoft and was last updated in April 2025.
100,000 image conversation samples derived from 45,000 web documents in the OBELICS dataset. GPT-4V and OpenChat 3.5 were used to generate contextual captions and convert them into diverse free-form conversations. The dataset was authored by tiiuae and last updated on February 17, 2025.
12.4 million image-caption pairs constitute the largest public domain image-text dataset for training foundation models. The dataset was created by Spawning and released in October 2024, as indicated by the associated arXiv paper identifier. It features community-driven governance mechanisms aimed at reducing harm and supporting reproducibility.
5,047 real-world indoor scenes captured using Apple's ARKit framework, preprocessed for SpatialLM training. The dataset is formatted for oriented object bounding box detection with large language models. It was created by Gen3DF and last updated on June 30, 2025.
Three categories of multimodal geo-spatial data—tabular grids, heatmaps, and geographic visualizations—designed for foundation model evaluation. The benchmark tests the ability to process dense numerical values and interpret spatial-temporal dependencies within these grid structures.
1.432 million image-QA instances developed by wentao-yuan in 2024 facilitate fine-tuning Vision-Language Models for spatial affordance prediction. The collection integrates 667K synthetic instances for object and free space referencing with 100K LVIS detection samples and 150K instruction-following pairs.
DriveQA is a multimodal benchmark for evaluating driving knowledge through text and vision-based question-answering tasks. The dataset, created by DriveQA and last updated on September 1, 2025, simulates real-world driving tests. It likely contains questions on traffic regulations, sign recognition, and right-of-way reasoning.
A dataset created by remyxai and last updated on April 23, 2025. It is designed for training LLaVA-style Vision-Language Models and contains synthesized spatial reasoning traces. The data was generated using VQASynth from a subset of images in the localized narratives split of the cauldron.
3DSRBench is a manually annotated benchmark for evaluating 3D spatial reasoning in large multimodal models. It contains 2,100 visual question-answering pairs on MS-COCO images and 672 on multi-view synthetic images rendered from HSSD. The dataset was created by author 'ccvl' and was last updated on the Hugging Face platform in February 2025.
PDF-WuKong is a dataset for training and evaluating large multimodal models on long PDF documents. The data accompanies the research paper 'PDF-WuKong: A Large Multimodal Model for Efficient Long PDF Reading with End-to-End Sparse Sampling'. Author yh0075 uploaded it to Hugging Face on 2025-01-06.
OpenGVLab's OmniCorpus CC 210M dataset contains 210 million image-text interleaved documents filtered from the Common Crawl web corpus. The dataset is designed for large-scale vision-language model training, as described in an ICLR 2025 spotlight paper. It was last updated on the Hugging Face platform in March 2025.
6,463 entries representing 6,289 unique headwords in Korean Sign Language (KSL), derived from the Korean Sign Language Dictionary's everyday signs collection. The dataset provides detailed linguistic annotations for each sign, expanding upon the original dictionary's 3,669 signs to offer a more granular lexical database.
WMT24++ Images provides source URLs and full-page document screenshots for the translation data used in the WMT24++ project. The dataset, created by Google and last updated on 2025-02 24, preserves original document structure with embedded images. It is intended to support multimodal translation and language understanding research.
A dataset for instruction tuning with GPT-4, created by the team referenced in the citation. The dataset page was last updated on 2023-05-03. It is intended for research use only and is licensed under CC BY NC 4.0.
SEAGULL-100w is a large-scale synthetic dataset for no-reference image quality assessment focused on regions of interest. It was created by Zevin2023 and includes images with six distortion types: blur, sharpness, exposure, contrast, colorfulness, and compression. The dataset was last updated on the Hugging Face platform in November 2024.
A question-answer dataset formatted for fine-tuning large language models. The data is sourced from PDF and markdown files extracted from various project repositories within the Cloud Native Computing Foundation landscape. It was created by Kubermatic and last updated on June 27, 2024.
RadFM_data_csv is a collection of files used for training and testing the RadFM foundation model. The dataset includes a radiology test set with captions and article links, a visual question-answering subset for radiology images, and linked article contents. It was authored by chaoyi-wu and last updated on 2024-06-02.
Art Museums PD 440K is a dataset for training text-to-image and multimodal models, containing images and captions sourced from public domain or CC0-licensed materials. The dataset includes English captions translated to Japanese using the ElanMT model, which was trained on licensed corpus. The creator is Mitsua, with the dataset last updated on February 13, 2025.
VizWiz-VQA is a large-scale dataset for evaluating large multi-modality models. It is a formatted version used in the lmms-eval pipeline for one-click model evaluations. The dataset was created by lmms-lab and was last updated on March 8, 2024.