DataSalon

Discover quality datasets for AI training — aggregated from 40+ platforms, curated by AI.

ProductSearch Datasets Browse Topics Rankings Community API / MCP

ResourcesDocumentation Blog Changelog Status

LegalPrivacy Policy Terms of Service Cookie Policy

Granary: 1 Million Hours of Speech for 25 European Languages | DataSalon

Home Speech & AudioGranary: 1 Million Hours of Speech for 25 European Languages

Speech & Audio

Granary: 1 Million Hours of Speech for 25 European Languages

Name: Granary: 1 Million Hours of Speech for 25 European Languages
Creator: nvidia
Published: 2025-05-15T14:57:28
Keywords: Languageen, Languageda, Languageel, Languagebg, Languagecs, Task Categoriesautomatic Speech Recognition, Languagees, Task Categoriestranslation, Languagede

by nvidia·Updated 4mo ago

Available on 1 platform

Description

NVIDIA's Granary dataset provides approximately 1 million hours of high-quality speech data across 25 European languages for speech recognition and translation. Released in 2026, it consolidates multiple sources into a unified framework to support low-resource language modeling. The collection is designed for both Automatic Speech Recognition (ASR) and Automatic Speech Translation (AST) tasks.

Use Cases

Training ASR models for low-resource languages using the 'mt' (Maltese) or 'lv' (Latvian) language labels
Developing AST systems to translate spoken 'fr' (French) or 'de' (German) audio into text
Fine-tuning multilingual models across the 25 European language categories provided in the unified framework

Strengths

~1 million hours of speech data
Coverage of 25 European languages including low-resource variants
Unified framework for ASR and AST tasks

Limitations

Lack of detailed column-level metadata in the provided summary
Potential variability in audio quality across consolidated sources

Provenance

Source: NVIDIA
Collection Method: Consolidation of multiple existing datasets under a unified framework
Freshness: Last updated March 2026.
Geography: Europe

Users should verify the specific licenses of the consolidated datasets within the unified framework as individual components may have varying restrictions.

Languageen Languageda Languageel Languagebg Languagecs Task Categoriesautomatic Speech Recognition Languagees Task Categoriestranslation Languagede

Related Datasets

Quality Score

D39

Description

Source

Reputation

Quality Score

D39

Description

Source

Reputation

Access

Community

4.1K downloads

190 likes

0 views

Dataset Info

Author: nvidia
Created: May 15, 2025
Updated: Mar 12, 2026
Last synced: Jul 24, 2026

Access

Community

4.1K downloads

190 likes

0 views

Dataset Info

Author: nvidia
Created: May 15, 2025
Updated: Mar 12, 2026
Last synced: Jul 24, 2026

Granary: 1 Million Hours of Speech for 25 European Languages

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info