VAST is an omni-modality dataset and foundation model from NeurIPS 2023 containing four distinct data categories: vision, audio, subtitles, and text. It provides a framework for multi-modal learning where visual frames are paired with corresponding sound, textual transcripts, and descriptive text.
Use Cases
- Train foundation models using the vision, audio, and subtitle modalities to create unified cross-modal embeddings
- Develop video-to-text generation systems that leverage subtitle and audio features for more accurate descriptions
- Evaluate multi-modal retrieval performance by matching text prompts against the combined vision-audio-subtitle feature space
Strengths
- Integrates four distinct modalities: vision, audio, subtitles, and text
- Presented at the NeurIPS 2023 conference
- Provides synchronized temporal alignment between video frames and audio tracks
- Includes subtitle strings as a discrete modality for enhanced linguistic context