Sign in to view source links and access this dataset
Description
A benchmark dataset created by SkunkWorkLabs, last updated in May 2026, for evaluating Hindi automatic speech recognition (ASR) systems. It compares the performance of the SkunkWorks model against commercial providers like ElevenLabs, Deepgram, and Sarvam. The evaluation is conducted across six distinct subsets sourced from projects like AI4Bharat Kathbath, Mozilla Common Voice, and Google FLEURS.
Use Cases
Benchmarking Hindi ASR model performance based on the comparison of multiple commercial providers.
Evaluating model robustness in noisy conditions based on the 'kathbath_noisy' subset.
Assessing model generalization across diverse data sources based on the six distinct evaluation subsets.
Conducting comparative analysis of open-source versus commercial ASR systems for Hindi.
Strengths
Evaluates performance across six distinct and named test subsets, including Kathbath, Common Voice, and MUCS.
Provides a direct comparison between a specific model (SkunkWorks) and three major commercial ASR providers.
Includes a subset specifically designed for noisy microphone conditions ('kathbath_noisy').
Limitations
Column-level documentation is absent; field semantics must be inferred after download.
Row count is unknown for most subsets, which may limit suitability assessment for large-scale training.
Description metadata is limited; actual data quality and audio file formats require manual inspection.
Provenance
Source
SkunkWorkLabs, aggregated from multiple sources including AI4Bharat, Mozilla, and Google.
Collection Method
Likely compiled from existing public speech datasets for benchmark creation.
Time Range
null
Freshness
Last updated 2026-05-04 16:46:18; freshness should be verified.
Geography
Primarily Hindi language data, likely focused on Indian contexts.
License information is unknown; terms of use for the aggregated sources must be verified.