Training data for the LLaVA-OneVision-2 family of multimodal models, covering large-scale video and spatial reasoning corpora used in mid-training. The dataset includes subsets like 'mid_training_video/60s_rest/' with 10,809 shards of approximately 60-second video clips and JSONL files containing captions for 30-second and 60-second clips. It was created by mvp-lab and last updated on May 6, 2026.
Use Cases
- Mid-training of multimodal LLMs based on large-scale video corpora.
- Video captioning model development based on the provided 30-second and 60-second caption files.
- Spatial reasoning task training for vision-language models.
- Benchmarking or evaluating video-language model performance.
Strengths
- Contains a structured subset of 10,809 shards of video clips, each approximately 60 seconds long.
- Includes dedicated caption files (JSONL format) for 30-second and 60-second video segments.
- Specifically designed for mid-training of a named multimodal model family (LLaVA-OneVision-2).
Limitations
- Description metadata is limited; actual data quality, content specifics, and video sources require manual inspection after download.
- Column-level documentation is absent; field semantics must be inferred after download.
- Row count, total size, and license information are unknown, which may limit suitability assessment.
Provenance
- Source
- mvp-lab
- Collection Method
- Likely compiled for mid-training of the LLaVA-OneVision-2 model family; specific gathering method is not detailed.
- Time Range
- null
- Freshness
- Last updated 2026-05-06 11:27:31; freshness should be verified.
- Geography
- null