Name: UrduSpeech: 156-Hour Urdu Speech Corpus with Paralinguistic Metadata
Creator: ASLP-lab
Published: 2026-05-25T14:49:56
Keywords: Audio Diarization, Code Switching, Speech Corpus, Urdu Language, Paralinguistics, Audio, Large Scale, Natural Language Processing

Description

156 hours of high-fidelity Urdu audio address a critical under-resourcing in speech technology. The corpus contains 71,792 diarized utterances across three specialized subsets: Standard Pakistani Urdu, Urdu-English Code-Switched, and Pakistani-Accented English. It was created by ASLP-lab and last updated in June 2026.

Use Cases

Train automatic speech recognition models based on the large-scale, diarized Urdu audio.
Develop code-switching detection algorithms based on the dedicated Urdu-English subset.
Study paralinguistic features like emotion or speaker traits based on the 12-dimensional metadata.
Build text-to-speech systems for Pakistani-accented English based on the US-EngPk subset.
Benchmark speaker diarization performance on the 71,792 segmented utterances.

Strengths

156 hours of total audio provides substantial training material.
71,792 diarized utterances offer fine-grained segmentation.
12-dimensional paralinguistic metadata enables multifaceted analysis.
Three distinct subsets (Standard, Code-Switched, English) support comparative studies.

Limitations

Column-level documentation is absent; field semantics must be inferred after download.
Row count is unknown, which may limit suitability assessment.
Freshness should be verified as the last update timestamp is from the future (2026).

Provenance

Source: ASLP-lab
Freshness: Last updated 2026-06-04 12:43:17.
Geography: Likely Pakistan, given the focus on Standard Pakistani Urdu and Pakistani-accented English.

License information is unknown and should be confirmed before use.

Audio Audio Diarization Code Switching Speech Corpus Urdu Language Paralinguistics Large Scale Natural Language Processing

UrduSpeech: 156-Hour Urdu Speech Corpus with Paralinguistic Metadata

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info