A corrected version of the Mozilla CommonVoice 17 Turkish corpus for speech recognition tasks. It utilizes filename stems as unique keys to reorganize the data structure and improve split consistency for model training.
Use Cases
- Train Turkish automatic speech recognition models using the audio recordings and text transcriptions
- Perform data deduplication and split validation using the filename stems as unique identifiers
- Fine-tune acoustic models on the Turkish language using the fixed dataset structure
Strengths
- Based on the Mozilla CommonVoice 17 Turkish dataset
- Implements filename stems as unique keys for data integrity and deduplication
- Optimized for speech recognition training through reorganized data splits