Description

India's linguistic diversity across all districts is captured in this derived dataset from Project Vaani, a large-scale multilingual speech initiative by IISc Bangalore and ARTPARK. The dataset contains noise event timestamps and is actively being built, with a current subset of a planned corpus of approximately 167 hours of training data. The dataset page was last updated on 2026-06-05.

Use Cases

Train noise detection models based on annotated noise event timestamps.
Develop speech enhancement algorithms based on isolated noise segments.
Benchmark multilingual speech recognition systems based on data covering India's linguistic diversity.
Study acoustic environments across India's districts based on the geographic scope of the corpus.

Strengths

Derived from Project Vaani, a large-scale initiative by IISc Bangalore and ARTPARK.
Planned corpus includes approximately 167 hours of training data.
Covers India's linguistic diversity across all districts.

Limitations

Dataset is incomplete and actively being built, with only a subset currently available.
Column-level documentation is absent; field semantics must be inferred after download.
Row count, file formats, and license information are unknown.

Provenance

Source: Project Vaani, a collaboration between IISc Bangalore and ARTPARK.
Collection Method: Derived from a large-scale multilingual speech collection initiative.
Freshness: Last updated 2026-06-05 19:46:11; freshness should be verified.
Geography: India, covering all districts.

Dataset is actively being built; users should check the Hugging Face page for the latest updates and completeness.

Audio 🇮🇳 India Multilingual Large Scale Natural Language Processing Noise Events Audio Processing

Vaani Noise Event Timestamps: Multilingual Speech from India

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info