A multi-source collection of German speech audio paired with transcriptions and English translations, curated by aman4014. The dataset is designed for training and evaluating Automatic Speech Recognition, Speech Translation, and Text-to-Speech systems. It was last updated on March 30, 2026.
Use Cases
- Training German Automatic Speech Recognition models based on German audio and transcriptions.
- Developing Speech Translation systems based on paired German audio and English translations.
- Building multilingual Text-to-Speech systems based on the unified speech corpus.
- Evaluating model performance on a curated mixture of established open-source speech corpora.
Strengths
- The dataset is described as large-scale and multi-source.
- It contains paired German audio, German transcriptions, and English translations.
- It is a curated mixture of well-established open-source German and multilingual speech corpora.
Limitations
- Column-level documentation is absent; field semantics must be inferred after download.
- Row count is unknown, which may limit suitability assessment.
- Last updated 2026-03-30 10:49:21; freshness should be verified.
Provenance
- Source
- A mixture of well-established open-source German and multilingual speech corpora.
- Collection Method
- Curated and unified under a common schema.