DataSalon

Discover quality datasets for AI training — aggregated from 40+ platforms, curated by AI.

ProductSearch Datasets Browse Topics Rankings Community API / MCP

ResourcesDocumentation Blog Changelog Status

LegalPrivacy Policy Terms of Service Cookie Policy

VoxLingua107: 107-Language Speech Dataset for Language ID | DataSalon

Home Speech & AudioVoxLingua107: 107-Language Speech Dataset for Language ID

Speech & Audio

VoxLingua107: 107-Language Speech Dataset for Language ID

Name: VoxLingua107: 107-Language Speech Dataset for Language ID
Creator: SEACrowd
Published: 2024-06-24T12:26:40
Keywords: Youtube Sourced, Language Detection, Audio Classification, Speech Identification, Multilingual, Audio, Multilingual Audio

by SEACrowd·Updated 1y ago

Available on 1 platform

Description

VoxLingua107 is a speech dataset for training spoken language identification models. It contains 6628 hours of short speech segments sourced from YouTube videos, covering 107 languages. The dataset was created by SEACrowd and was last updated in June 2024.

Use Cases

Train language identification models based on short speech segments.
Benchmark audio classification algorithms on a multilingual dataset.
Develop speech recognition systems for underrepresented languages based on the 107-language coverage.
Validate language detection accuracy using the volunteer-verified development set.

Strengths

Covers 107 languages, providing broad linguistic diversity.
Contains 6628 total hours of speech data.
Includes a separate development set of 1609 segments validated by at least two volunteers.

Limitations

The amount of data per language varies significantly, averaging 62 hours.
Description metadata is limited; actual data quality requires manual inspection after download.
Column-level documentation is absent; field semantics must be inferred after download.

Provenance

Source: YouTube videos.
Collection Method: Speech segments sourced from YouTube videos, labeled based on language indicated in video title and description.
Freshness: Last updated 2024-06-24 13:32:12; freshness should be verified.

License is unknown; terms of use must be verified before application.

Audio Multilingual Youtube Sourced Language Detection Audio Classification Speech Identification Multilingual Audio

Related Datasets

Quality Score

D37

Description

Source

Reputation

Quality Score

D37

Description

Source

Reputation

Access

Community

11 downloads

1 likes

0 views

Dataset Info

Author: SEACrowd
Created: Jun 24, 2024
Updated: Jun 24, 2024
Last synced: Jun 13, 2026

Access

Community

11 downloads

1 likes

0 views

Dataset Info

Author: SEACrowd
Created: Jun 24, 2024
Updated: Jun 24, 2024
Last synced: Jun 13, 2026

VoxLingua107: 107-Language Speech Dataset for Language ID

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info