Loading...
Loading...
Image-text pairs, instruction tuning, visual QA, cross-modal data, foundation model training data
1,543 datasets
Food-Health-VLM-Checkpoint is a dataset published on Kaggle. The title suggests it contains a checkpoint for a Vision-Language Model (VLM) trained on food and health-related data. Specific details regarding data volume, creation method, and authorship are not provided in the available metadata.
A multimodal dataset related to Tamil culture and backdoor attacks in machine learning. The dataset is hosted on Kaggle, but its specific contents, size, and creation details are not provided in the available metadata. Further details regarding the data's origin, collection method, and temporal scope are unknown.
A dataset titled 'WAVLM_Region(VF)' is hosted on Kaggle. The title suggests it contains audio feature representations, likely derived from the WAVLM model. No further metadata on size, format, or creation details is available.
Food-health-vlm-checkpoint-best-1 is a model checkpoint likely containing trained weights for a Vision-Language Model (VLM). The dataset is hosted on Kaggle, but its specific contents, creation date, and author are not detailed in the provided metadata. Its title suggests a focus on applying multimodal AI to topics at the intersection of food and health.
Kaggle hosts the WAVLM_Gender(VF) dataset. The title suggests it contains speech audio data likely intended for gender classification tasks. Specific details on volume, creator, and creation date are unavailable.
A dataset focused on emotion recognition, likely containing multiple data modalities such as text, audio, or images. It is hosted on Kaggle, a platform for data science and machine learning projects. The specific collection method, author, and temporal coverage are not detailed in the available metadata.
A multimodal dataset related to Tamil culture and backdoor attacks in machine learning. The dataset is hosted on Kaggle, but its specific contents, size, and creation details are not provided in the available metadata. Further details regarding the data's origin, collection method, and temporal scope are unknown.
EgoGazeVQA is an egocentric gaze-guided video question answering benchmark introduced in the paper 'In the Eye of MLLM: Benchmarking Egocentric Video Intent Understanding with Gaze-Guided Prompting'. The dataset, created by author taiyi09, leverages gaze information to improve the understanding of daily-life videos. It was last updated on 2026-01-22.
OctoCodingBench is a benchmark for evaluating scaffold-aware instruction following in repository-grounded agentic coding. It was created by MiniMaxAI and last updated on Hugging Face on January 13, 2026. The benchmark focuses on whether coding agents follow rules while solving tasks, a dimension not covered by existing task-completion benchmarks.
A dataset from Kaggle containing baseline results for the VIGOR-LLaVA-CAS benchmark. The dataset likely contains performance metrics and outputs for evaluating vision-language models. Its specific size, update date, and author are unknown.
NVIDIA's Nemotron-RL-instruction_following-structured_outputs dataset tests a model's ability to follow output formatting instructions under JSON schema constraints. Each problem consists of a document, an output formatting instruction (schema), and a question, with difficulty varied by instruction location, comprehensiveness, and schema complexity. The dataset was last updated on January 12, -2026.
TRACE is a multimodal dataset published on Kaggle. The dataset's specific content, size, and creation details are not provided in the available metadata. Further verification is required to determine its exact composition, scale, and origin.
Afri-MCQA is a multimodal cultural question-answering benchmark. It contains 8,000 Q&A pairs across 16 African languages from 13 countries, created by native speakers. The dataset was published by Atnafu and last updated in January 2026.
Kaggle hosts the VQA Zewail dataset, likely focused on visual question answering tasks. The dataset's specific content, size, and origin are not detailed in the provided metadata. Its creation date and last update are unknown.
A subset of the dataset introduced in the paper 'ROMA: Real-time Omni-Multimodal Assistant with Interactive Streaming Understanding'. This dataset is designed to train multimodal models for streaming video understanding, focusing on proactive interaction tasks. It was authored by EurekaTian and last updated on the Hugging Face platform in January 2026.
OmniSpatial is a benchmark dataset for evaluating spatial reasoning in vision-language models, as presented in an ICLR 2026 paper. The data is structured in a JSON schema with components like 'id' for question identification. The dataset was created by author 'qizekun' and last updated on January 27, -2026.
A dataset for instruction tuning, likely containing text prompts and responses in the Maithili language. It was published on the Hugging Face platform by the author Bansal123 and was last updated on March 1, 2026. The specific content, size, and collection methodology are not detailed in the available metadata.
MMAU provides between 1,000 and 10,000 test records for evaluating audio large language models, released by TwinkStart in early 2026. It is integrated into the UltraEval-Audio framework to benchmark performance across 12 task types and 10 languages. The data spans four specialized domains: speech, general sound, medical audio, and music.
WAVLM Base Local is a self-supervised speech representation model. It is hosted on the Kaggle platform, but the dataset's specific contents, size, and creation details are not provided in the available metadata. The model's architecture and training methodology are likely detailed in its associated research publication.
Tagavlm Dataset is a multimodal dataset hosted by HuggingFace, created by user tiredtony. It is intended for vision-language model training and was last updated in March 2026. Its specific contents and size are not detailed.