Per-segment captions for multi-view video datasets of humans and animals. Captions were generated from masked multi-view composites using the Gemini 3 Flash model and follow the ActivityNet-style dense video-captioning layout. The dataset was authored by 'andaba' and last updated on June 12, -2026.
Use Cases
- Train dense video captioning models based on the ActivityNet-style layout with parallel captions and timestamps.
- Fine-tune vision-language models for multi-view video understanding based on the generated captions.
- Benchmark multi-view video analysis algorithms based on the provided human (DNA-Rendering, ActorsHQ) and animal (Artemis / DFA) datasets.
- Study the alignment between visual segments and textual descriptions in multi-view composites.
Strengths
- Captions follow a structured ActivityNet-style dense video-captioning layout.
- Covers multiple specific datasets: human datasets (DNA-Rendering, ActorsHQ) and animal datasets (Artemis / DFA).
- Captions were generated using a specific, named AI model (Gemini 3 Flash).
Limitations
- Column-level documentation is absent; field semantics must be inferred after download.
- Row count is unknown, which may limit suitability assessment.
- Description metadata is limited; actual data quality requires manual inspection after download.
Provenance
- Source
- huggingface
- Collection Method
- Captions generated from masked multi-view composites with Gemini 3 Flash.
- Time Range
- null
- Freshness
- Last updated 2026-06-12 02:04:27; freshness should be verified.
- Geography
- null