A restructured subset of the AVSpeech dataset provides separated video and audio streams. The dataset was created by ProgramComputer and was last updated on February 20, 2026. Each clip has a unique identifier derived from the original YouTube ID and timestamps.
Use Cases
- Train audio-visual speech recognition models based on separated video and audio streams.
- Develop lip-syncing or visual speech generation models based on the video-only stream.
- Conduct research on audio-visual correspondence using the synchronized but separate media tracks.
- Benchmark multimodal alignment algorithms using the derived clip identifiers and original metadata.
Strengths
- Provides pre-separated video and audio streams, which likely simplifies data loading for multimodal tasks.
- Includes original AVSpeech metadata fields such as YouTube ID and clip timestamps for traceability.
- Media streams are described as being copied without re-encoding, which may preserve original quality.
Limitations
- Description metadata is limited; actual data quality requires manual inspection after download.
- Row count is unknown, which may limit suitability assessment.
- Column-level documentation is absent; field semantics must be inferred after download.
Provenance
- Source
- Derived from the original AVSpeech dataset, which sourced clips from YouTube.
- Collection Method
- Media streams were separated and copied without re-encoding from the original source videos.
- Freshness
- Last updated 2026-02-20 22:21:03; freshness should be verified.