HLVid is a benchmark for evaluating Multi-modal Large Language Models on long-form, high-resolution video understanding. It was introduced by author bfshi in the paper "Attend Before Attention: Efficient and Scalable Video Understanding via Autoregressive Gazing". The dataset features 5-minute videos at 4K resolution, challenging models to handle significant spatiotemporal redundancy.
Use Cases
- Benchmarking video understanding models based on 5-minute long-form videos
- Evaluating model efficiency on high-resolution 4K video data
- Testing multimodal reasoning on video-text question answering tasks
- Researching methods to handle spatiotemporal redundancy in video data
Strengths
- Features 5-minute long-form videos, providing a challenging temporal scale
- Uses high-resolution 4K video, offering detailed visual data
- Specifically designed to benchmark Multi-modal Large Language Models (MLLMs)
Limitations
- Column-level documentation is absent; field semantics must be inferred after download
- Row count is unknown, which may limit suitability assessment
- Description metadata is limited; actual data quality requires manual inspection after download
Provenance
- Source
- bfshi via Hugging Face
- Collection Method
- Introduced as a benchmark in the associated research paper.
- Freshness
- Last updated 2026-03-19 20:39:55; freshness should be verified