Name: Open Large Bengali ASR Data: 5000 Hours of Speech Audio with Quality Filter
Creator: SKNahin
Published: 2024-03-23T18:52:45
Keywords: Bengali, Multilingual, Audio, Natural Language Processing, Speech Recognition

Description

A collection of 5000 hours of Bengali speech audio for automatic speech recognition, aggregated from nine public sources including Common Voice and OpenSLR. The dataset, created by SKNahin and last updated in March 2024, includes a filtering column to identify higher-quality audio segments based on word error rate and word-per-second metrics.

Use Cases

Training Bengali speech-to-text models based on the large volume of labeled audio.
Benchmarking ASR model performance across different data sources listed in the description.
Filtering training data for quality using the provided 'is_better' column based on WER and WPS.
Studying acoustic and linguistic diversity in Bengali speech from multiple public corpora.

Strengths

Contains 5000 hours of Bengali audio, a substantial volume for model training.
Includes a quality-filtering mechanism ('is_better' column) based on objective metrics (WER and WPS).
Aggregates data from nine distinct public sources, potentially increasing diversity.

Limitations

Column-level documentation is absent; field semantics must be inferred after download.
Specific license terms for the aggregated data are unknown, which may restrict usage.
Data may reflect geographic or demographic bias inherent to the source platforms.

Provenance

Source: Aggregated from nine public sources: commonvoice, openslr, madasr, shrutilipi, flerus, kathbath, indictts, ucla, gali.
Collection Method: Publicly available ASR data collected and filtered.
Time Range: null
Freshness: Last updated 2024-03-26 09:50:50; freshness should be verified.
Geography: null

License information is unknown; users must verify permissible use and attribution requirements for the aggregated sources.

Audio Multilingual Bengali Natural Language Processing Speech Recognition

Open Large Bengali ASR Data: 5000 Hours of Speech Audio with Quality Filter

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info