Loading...
Loading...
Image-text pairs, instruction tuning, visual QA, cross-modal data, foundation model training data
1,540 datasets
75,491 image and question-answer pairs depicting microscopic organisms, specifically diatoms and fungal spores. The dataset covers 95 genera and is released under a CC-BY 4.0 license. It was created for a hackathon event on the Kaggle platform.
InfoBayAI published this Arabic non-STEM textbook sample in March 2026, providing between 1,000 and 10,000 records for LLM training. It is derived from a larger multilingual corpus of 1.9 billion words across 27,000 textbooks and is structured for instruction tuning and evaluation.
VCIF-10K provides data for training Multimodal Large Language Models on visual instruction following tasks. The dataset is structured in a messages format with user instructions and assistant responses, referencing images from sources like LLaVA-Instruct and Visual Genome. It was created by WoofWoof and supports both Supervised Fine-Tuning and Direct Preference Optimization training paradigms.
HORA is a large-scale multimodal dataset that converts human handβobject interaction demonstrations into robot-usable supervision. It combines HOI-style annotations like MANO hand parameters and object pose with embodied-robot learning signals such as end-effector trajectories under a unified canonical action space. The dataset was created by HORA-DB and last updated on Hugging Face in March 2026.
A synthetic task dataset of 1,070,917 agentic command operations for testing multimodal AI agents. The dataset is engineered for evaluating AI agents operating within complex software infrastructures like creative and engineering tools. It was created by author kryp1234 and last updated on March 15, 2026.
VisionFoundry-10K provides 10,000 synthetic image-question-answer triples across 10 vision-centric tasks, released by TheMartyr in 2026. The data is produced via a pipeline where an LLM generates prompts, a text-to-image model synthesizes visuals, and a multimodal verifier filters for alignment.
Five LLaVA model checkpoints uploaded by author xym93168 on Hugging Face in April 2026. The checkpoints document different training stages, including pre-training and supervised fine-tuning, with varying GPU configurations and batch sizes. Specific checkpoints include 'Pre_32gpu_llava_bs8_0111_1epoch/checkpoint-2181' and 'sft_8gpu_llava_bs08_0106_1epoch/checkpoint-2353'.
TextEditBench is a benchmark for evaluating reasoning-aware text editing across 14 topics and 6 task types. It was created by CSU-JPG and last updated on March 9, 2026. The benchmark emphasizes scenarios requiring understanding of physical plausibility, linguistic meaning, and cross-modal dependencies.
OptimusKG is a modern biomedical multimodal Label Property Graph (LPG). The dataset was authored by Lucas Vittor and is hosted on the Harvard Dataverse platform, with a last recorded update on April 14, -2026.
Zoo-Bus VQA is a synthetic visual question answering dataset built for spatial reasoning and object-centric grounding. It contains generated scenes with benches, stop signs, people, animals, and a clock object representing a bus. The dataset was created by author aprilavrilivan and last updated on March 22, 2026.
Opus-4.6 Reasoning 3000x filtered dataset provides a Turkish translation of reasoning data for LLM training. The dataset is created by Chan-Y to support instruction-following and alignment tasks in Turkish. It was last updated on March 22, 2026.
MR-RATE-vista-seg contains voxel-wise multi-label brain segmentation maps predicted for center modality brain MRI volumes in native space. The dataset is part of the MR-RATE vision-language foundation model release by author Forithmus, with a last recorded update in March 2026. It is hosted on Hugging Face and includes platform tags for healthcare, radiology, and multimodal tasks.
GroundSet is a large-scale Earth Observation dataset built on 20 cm resolution optical aerial orthophotos and legally verified cadastral vector data from the French national mapping agency (IGN). It is designed to advance fine-grained spatial understanding for multimodal models. The dataset was created by RogerFerrod and was last updated in March 2026.
Anthropic Hh Rlhf Preprocessed is a dataset published on huggingface by TheHassanSaud. The title suggests it contains preprocessed data from Anthropic's 'HH' (Helpful and Harmless) project, likely used for Reinforcement Learning from Human Feedback (RLHF). The dataset was last updated on 2026-04-24 18:40:45.
PersonaVLM is a framework for transforming general-purpose multimodal large language models into personalized assistants. The work, authored by ClareNie, was accepted for presentation at CVPR 2026.
Pre-tokenized `.bin` shards for efficient Assamese large language model training. The dataset is hosted on Kaggle, but the author, organization, and specific scale are unknown. The last update date is also unknown.
NVIDIA released this collection of dataset blends in March 2026 to document the specific data mixtures used for Reinforcement Learning (RL) training of the Nemotron-3-Super-120B-A12B model. The data is organized into six distinct training stages including Reinforcement Learning from Verifiable Rewards (RLVR), Software Engineering (SWE), and Reinforcement Learning from Human Feedback (RLHF).
A multimodal dataset from HuggingFace provides synchronized vision and tactile glove sensor data across distinct tasks. The dataset includes RGB video at 30 Hz and 720p resolution, lossless 16-bit depth streams, monochrome camera views, and per-frame aligned tactile data in Parquet format. It was created by touchtronix and last updated on March 16, 2026.
DECO-50 comprises over 5 million frames of teleoperated data for bimanual dexterous manipulation with tactile sensing. The dataset includes 50 hours of data collected on real dual-arm robots across 4 scenarios and 28 subtasks. It was created by BAAI-Humanoid and was last updated on Hugging Face in February 2026.
Training corpus for GO-GPT, an autoregressive transformer model for Gene Ontology term prediction. It contains proteins annotated with GO terms, InterPro domains, STRING protein-protein interactions, and metadata sourced from UniProt.