DataSalon

Discover quality datasets for AI training — aggregated from 40+ platforms, curated by AI.

ProductSearch Datasets Browse Topics Rankings Community API / MCP

ResourcesDocumentation Blog Changelog Status

LegalPrivacy Policy Terms of Service Cookie Policy

Contextual ASR Benchmark: Synthetic Voice Bot Data for 10 Indic Languages | DataSalon

Home Speech & AudioContextual ASR Benchmark: Synthetic Voice Bot Data for 10 Indic Languages

Speech & Audio

Contextual ASR Benchmark: Synthetic Voice Bot Data for 10 Indic Languages

Name: Contextual ASR Benchmark: Synthetic Voice Bot Data for 10 Indic Languages
Creator: sarvamai
Published: 2026-02-02T19:58:50
Keywords: Size Categories1 Kn10 K, Librarypolars, Modalityaudio, OPTIMIZED-PARQUET, Modalitytext, Librarymlcroissant, Librarydatasets, Librarypandas, Parquet, Regionus

by sarvamai·Updated 5mo ago

Available on 1 platform

Description

Sarvam AI developed this synthetic benchmark in 2026 to evaluate context-aware Automatic Speech Recognition (ASR) within voice bot environments. The collection includes between 1,000 and 10,000 records covering the top 10 Indian languages, focusing on how conversation history and agent prompts influence transcription accuracy.

Use Cases

Evaluating ASR accuracy using conversation history to improve transcription of user responses
Fine-tuning models using agent prompts to provide linguistic context for speech-to-text tasks
Benchmarking multilingual ASR performance across the 10 supported Indian languages

Strengths

Covers 10 distinct Indian languages
Provides conversation history and agent prompts for context-aware ASR testing
Stored in optimized Parquet format for high-performance data loading

Limitations

Synthetic generation may not capture the full range of real-world acoustic noise
Small scale of 1,000 to 10,000 records limits its use for large-scale pre-training
Scenario-specific focus on voice bots may not generalize to other speech domains

Provenance

Source: sarvamai
Collection Method: synthetic
Freshness: Last updated February 2026.
Geography: India

OPTIMIZED-PARQUET Parquet Size Categories1 Kn10 K Librarypolars Modalityaudio Modalitytext Librarymlcroissant Librarydatasets Librarypandas Regionus

Related Datasets

Quality Score

D37

Description

Source

Reputation

Quality Score

D37

Description

Source

Reputation

Access

Community

101 downloads

3 likes

0 views

Dataset Info

Author: sarvamai
Created: Feb 2, 2026
Updated: Feb 2, 2026
Last synced: May 4, 2026

Access

Community

101 downloads

3 likes

0 views

Dataset Info

Author: sarvamai
Created: Feb 2, 2026
Updated: Feb 2, 2026
Last synced: May 4, 2026

Contextual ASR Benchmark: Synthetic Voice Bot Data for 10 Indic Languages

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info