A Bahnar speech translation dataset contains audio aligned with Bahnar, Vietnamese, and English text. It was created from internet data sources and automatically aligned using the Bahnar-Vietnamese-S2TT pipeline. The dataset includes 113,830 utterances in its training split.
Use Cases
- Train models for direct speech-to-text translation from Bahnar to Vietnamese based on the aligned audio and text.
- Develop multilingual speech recognition systems based on the Bahnar, Vietnamese, and English text transcriptions.
- Research low-resource language processing techniques based on the dataset's focus on Bahnar.
- Benchmark automatic alignment pipelines for speech data based on the described creation method.
Strengths
- Training split contains 113,830 utterances, providing a substantial base for model training.
- Provides aligned text in three languages (Bahnar, Vietnamese, English), enabling multi-task learning.
Limitations
- Description metadata is limited; actual data quality requires manual inspection after download.
- Column-level documentation is absent; field semantics must be inferred after download.
Provenance
- Source
- cuong06
- Collection Method
- Created from internet data sources and automatically aligned using the Bahnar-Vietnamese-S2TT pipeline.
- Freshness
- Last updated 2026-05-28 08:22:32; freshness should be verified.