A subset of 89,440 videos with 608,000 event instances, annotated for temporal grounding. The dataset was created by yingsen and last updated on August 1, 2025. It is derived from the InternVid-FLT video-text alignment data through an automated annotation process detailed in the associated paper.
Use Cases
- Train models for video temporal grounding based on the annotated event instances.
- Benchmark video-language models on temporal localization tasks.
- Develop video large language models (Video LLMs) using the multimodal video-text data.
- Research automated video annotation methods based on the described Distime process.
Strengths
- Contains 608,000 annotated event instances.
- Based on a subset of 89,440 videos.
- Annotation process is detailed in an associated research paper.
Limitations
- Description metadata is limited; actual data quality requires manual inspection after download.
- Column-level documentation is absent; field semantics must be inferred after download.
- Row count is unknown, which may limit suitability assessment.
Provenance
- Source
- huggingface, author yingsen
- Collection Method
- Automated annotation derived from InternVid-FLT video-text alignment data.
- Time Range
- null
- Freshness
- Last updated 2025-08-01 13:51:35; freshness should be verified.
- Geography
- null