VoxLingua107 is a speech dataset for training spoken language identification models. It contains 6628 hours of short speech segments automatically extracted from YouTube videos and labeled for 107 languages. The dataset was created by TalTechNLP and was last updated on September 4, 2025.
Use Cases
- Train language identification models based on labeled speech segments
- Benchmark audio classification algorithms based on multilingual data
- Develop speech processing tools for multilingual applications based on the 107-language coverage
Strengths
- Contains data for 107 distinct languages
- Total training set size is 6628 hours of speech
- Average amount of data per language is 62 hours
Limitations
- Data is automatically extracted from YouTube, which may introduce source-specific biases
- Column-level documentation is absent; field semantics must be inferred after download
- Row count is unknown, which may limit suitability assessment
Provenance
- Source
- TalTechNLP
- Collection Method
- Automatically extracted from YouTube videos, with post-processing to filter false positives
- Freshness
- Last updated 2025-09-04 07:23:22