Vaani Benchmark V1.0: Hindi Speech Recognition with 5,343 Audio Segments

Name: Vaani Benchmark V1.0: Hindi Speech Recognition with 5,343 Audio Segments
Creator: ARTPARK-IISc
Published: 2026-06-05T11:50:15
Keywords: Benchmark, Multilingual, Hindi, Audio, Audio Transcription, Speech Recognition

by ARTPARK-IIScUpdated 8d ago

Available on 1 platform

Sign in to view source links and access this dataset

Description

ARTPARK-IISc's Vaani Benchmark V1.0 is a curated Hindi automatic speech recognition (ASR) evaluation set. It contains 5,343 audio segments from 1,103 speakers across 104 Indian districts, totaling approximately 11.7 hours. Each audio segment includes three independent human transcriptions.

Use Cases

Benchmarking Hindi ASR model accuracy based on the 5,343 audio segments with human transcriptions.
Evaluating model performance on code-switching speech based on the Hindi-with-code-switching language property.
Analyzing speaker and geographic diversity in ASR data based on the 1,103 speakers from 104 districts.
Assessing transcription consistency and quality based on the three independent human annotations per segment.

Strengths

Contains 5,343 audio segments, providing a substantial evaluation corpus.
Features three independent human transcriptions per segment, allowing for reliability assessment.
Covers 1,103 speakers from 104 districts across 16 Indian states, suggesting geographic and speaker diversity.

Limitations

Column-level documentation is absent; field semantics must be inferred after download.
Freshness should be verified; the last update date is 2026-06-05.

Provenance

Source: ARTPARK-IISc, drawn from the Vaani project.
Collection Method: Curated from the Vaani project with independent human transcriptions.
Time Range: null
Freshness: Last updated 2026-06-05 11:50:50.
Geography: 104 districts across 16 Indian states.

null

Audio Multilingual Hindi Benchmark Audio Transcription Speech Recognition

Related Datasets

Quality Score

C40

Description

51

Source

39

Reputation

35

Access

22

Community

1 likes

0 views

Dataset Info

Author: ARTPARK-IISc
Created: Jun 5, 2026
Updated: Jun 5, 2026
Last synced: Jun 14, 2026

Access

22

Community

1 likes

0 views

Dataset Info

Author: ARTPARK-IISc
Created: Jun 5, 2026
Updated: Jun 5, 2026
Last synced: Jun 14, 2026

Vaani Benchmark V1.0: Hindi Speech Recognition with 5,343 Audio Segments

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info