VoxCeleb2 Language-Detected Subset: Speaker Audio with Language Labels

Name: VoxCeleb2 Language-Detected Subset: Speaker Audio with Language Labels
Creator: johbac
Published: 2025-04-05T19:28:39
Keywords: Audio Metadata, Language Detection, Speaker Identification, Tabular, Audio, Speech Recognition

by johbacUpdated 1y ago

Available on 1 platform

Sign in to view source links and access this dataset

Description

A language-labeled version of the VoxCeleb2 speaker identification dataset. It was created by applying a language identification model to the original audio clips. The dataset was authored by johbac and last updated on Hugging Face in April 2025.

Use Cases

Training language identification models based on the added language labels.
Filtering or analyzing the VoxCeleb2 dataset by language for targeted research.
Benchmarking speaker recognition systems on language-specific subsets.
Studying the intersection of speaker identity and spoken language characteristics.

Strengths

Derived from the established VoxCeleb2 speaker identification dataset.
Language labels were added using a specific, named model (speechbrain/lang-id-voxlingua107-ecapa).
Metadata is structured in a CSV file with unique clip and speaker identifiers.

Limitations

Description metadata is limited; actual data quality requires manual inspection after download.
Column-level documentation is absent; field semantics must be inferred after download.
Row count is unknown, which may limit suitability assessment.

Provenance

Source: Derived from the ProgramComputer/voxceleb Hugging Face dataset (original VoxCeleb2).
Collection Method: Created by processing the original dataset with the speechbrain/lang-id-voxlingua107-ecapa language identification model.
Freshness: Last updated 2025-04-05 20:02:06; freshness should be verified.

Tabular Audio Audio Metadata Language Detection Speaker Identification Speech Recognition

Related Datasets

Quality Score

D36

Description

39

Source

41

Reputation

27

Access

26

Community

12 downloads

2 likes

0 views

Dataset Info

Author: johbac
Created: Apr 5, 2025
Updated: Apr 5, 2025
Last synced: Jun 8, 2026

Access

26

Community

12 downloads

2 likes

0 views

Dataset Info

Author: johbac
Created: Apr 5, 2025
Updated: Apr 5, 2025
Last synced: Jun 8, 2026

VoxCeleb2 Language-Detected Subset: Speaker Audio with Language Labels

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info