DataSalon

Discover quality datasets for AI training — aggregated from 40+ platforms, curated by AI.

ProductSearch Datasets Browse Topics Rankings Community API / MCP

ResourcesDocumentation Blog Changelog Status

LegalPrivacy Policy Terms of Service Cookie Policy

VoxLingua107: Multilingual Speech Segments for Language Identification | DataSalon

Home Speech & AudioVoxLingua107: Multilingual Speech Segments for Language Identification

Speech & Audio

VoxLingua107: Multilingual Speech Segments for Language Identification

Name: VoxLingua107: Multilingual Speech Segments for Language Identification
Creator: TalTechNLP
Published: 2025-08-27T06:47:20
Keywords: Language Detection, Audio Classification, Speech Identification, Multilingual, Audio

by TalTechNLP·Updated 9mo ago

Available on 1 platform

Description

VoxLingua107 is a speech dataset for training spoken language identification models. It contains 6628 hours of short speech segments automatically extracted from YouTube videos and labeled for 107 languages. The dataset was created by TalTechNLP and was last updated on September 4, 2025.

Use Cases

Train language identification models based on labeled speech segments
Benchmark audio classification algorithms based on multilingual data
Develop speech processing tools for multilingual applications based on the 107-language coverage

Strengths

Contains data for 107 distinct languages
Total training set size is 6628 hours of speech
Average amount of data per language is 62 hours

Limitations

Data is automatically extracted from YouTube, which may introduce source-specific biases
Column-level documentation is absent; field semantics must be inferred after download
Row count is unknown, which may limit suitability assessment

Provenance

Source: TalTechNLP
Collection Method: Automatically extracted from YouTube videos, with post-processing to filter false positives
Freshness: Last updated 2025-09-04 07:23:22

Audio Multilingual Language Detection Audio Classification Speech Identification

Related Datasets

Quality Score

C44

Description

Source

Reputation

Quality Score

C44

Description

Source

Reputation

Access

Community

16.5K downloads

2 likes

0 views

Dataset Info

Author: TalTechNLP
Created: Aug 27, 2025
Updated: Sep 4, 2025
Last synced: Jun 7, 2026

Access

Community

16.5K downloads

2 likes

0 views

Dataset Info

Author: TalTechNLP
Created: Aug 27, 2025
Updated: Sep 4, 2025
Last synced: Jun 7, 2026

VoxLingua107: Multilingual Speech Segments for Language Identification

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info