A language-labeled version of the VoxCeleb2 speaker identification dataset. It was created by applying a language identification model to the original audio clips. The dataset was authored by johbac and last updated on Hugging Face in April 2025.
Use Cases
- Training language identification models based on the added language labels.
- Filtering or analyzing the VoxCeleb2 dataset by language for targeted research.
- Benchmarking speaker recognition systems on language-specific subsets.
- Studying the intersection of speaker identity and spoken language characteristics.
Strengths
- Derived from the established VoxCeleb2 speaker identification dataset.
- Language labels were added using a specific, named model (speechbrain/lang-id-voxlingua107-ecapa).
- Metadata is structured in a CSV file with unique clip and speaker identifiers.
Limitations
- Description metadata is limited; actual data quality requires manual inspection after download.
- Column-level documentation is absent; field semantics must be inferred after download.
- Row count is unknown, which may limit suitability assessment.
Provenance
- Source
- Derived from the ProgramComputer/voxceleb Hugging Face dataset (original VoxCeleb2).
- Collection Method
- Created by processing the original dataset with the speechbrain/lang-id-voxlingua107-ecapa language identification model.
- Freshness
- Last updated 2025-04-05 20:02:06; freshness should be verified.