Name: YODAS-Granary: Pseudo-Labeled Speech Data for 23 European Languages
Creator: espnet
Published: 2025-06-09T20:25:03
Keywords: Multilingual, European Languages, Audio, Speech Translation, Speech Recognition, Pseudo Labeled, Multimodal

Description

YODAS-Granary is a curated subset of the NVIDIA Granary dataset, providing high-quality pseudo-labeled speech data. It is designed for Automatic Speech Recognition and Automatic Speech Translation tasks across 23 European languages. The dataset was shared by ESPnet and last updated on August 8, 2025.

Use Cases

Training Automatic Speech Recognition models based on the pseudo-labeled speech data.
Developing Automatic Speech Translation systems based on the multilingual audio content.
Benchmarking model performance across 23 European languages based on the curated subset.
Fine-tuning pre-trained speech models on high-quality, domain-specific pseudo-labels.

Strengths

Focuses on 23 European languages, providing multilingual coverage.
Curated for high quality from the larger NVIDIA Granary source.
Specifically designed for both Automatic Speech Recognition and Automatic Speech Translation tasks.

Limitations

Column-level documentation is absent; field semantics must be inferred after download.
Row count, file formats, and license information are unknown, which may limit suitability assessment.
Data may reflect geographic or source bias inherent to the original Granary collection.

Provenance

Source: A curated subset of the larger NVIDIA Granary dataset, shared by ESPnet.
Collection Method: Pseudo-labeled speech data, likely processed through automated pipelines.
Time Range: null
Freshness: Last updated 2025-08-08 15:48:18; freshness should be verified.
Geography: Covers 23 European languages.

null

Audio Multimodal Multilingual European Languages Speech Translation Speech Recognition Pseudo Labeled

YODAS-Granary: Pseudo-Labeled Speech Data for 23 European Languages

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info