22 Indian language speech subsets provided in Parquet format for the Hugging Face ecosystem. The collection includes language-specific configurations for modular access to audio data and transcriptions sourced from the AI4Bharat Nirantar project.
Use Cases
- Develop speech-to-text systems for Indian languages by loading the 'train' split of the 'hi' configuration.
- Compare acoustic properties across 22 different Indian languages using the language-wise subsets.
- Build language detection models by training on the distinct language configurations provided in the dataset.
Strengths
- Contains 22 language-specific configurations including 'hi' for Hindi.
- Utilizes Parquet files for efficient data handling and integration with the Hugging Face datasets library.
- Features a 'train' split for every language subset to facilitate model training.