DataSalon

Discover quality datasets for AI training — aggregated from 40+ platforms, curated by AI.

ProductSearch Datasets Browse Topics Rankings Community API / MCP

ResourcesDocumentation Blog Changelog Status

LegalPrivacy Policy Terms of Service Cookie Policy

Common Voice Geo Cleaned: 35 Hours of Georgian Speech for TTS | DataSalon

Home Speech & AudioCommon Voice Geo Cleaned: 35 Hours of Georgian Speech for TTS

Speech & Audio

Common Voice Geo Cleaned: 35 Hours of Georgian Speech for TTS

Name: Common Voice Geo Cleaned: 35 Hours of Georgian Speech for TTS
Creator: NMikka
Published: 2026-03-09T12:57:24
Keywords: Size Categories10 Kn100 K, Common Voice, Text To Speech, Task Categoriestext To Speech, Librarypolars, Librarydask, Task Categoriesaudio To Audio, Georgian, Licensecc0 10, Speech Synthesis, Modalitytext, Librarymlcroissant, Librarydatasets, Parquet, Regionus

by NMikka·Updated 4mo ago

Available on 1 platform

Description

21,421 cleaned Georgian speech samples totaling 35 hours were curated by NMikka from Mozilla Common Voice 19.0 in 2026. The collection features 24 kHz mono WAV audio from 12 speakers specifically filtered for speech synthesis and recognition tasks.

Use Cases

Fine-tuning text-to-speech models using the 24 kHz audio samples
Training Georgian speech-to-text systems using the provided transcriptions
Developing acoustic models for low-resource language synthesis

Strengths

35 hours of cleaned audio
24 kHz mono WAV format
CC-0 public domain license
12 distinct speakers

Limitations

Small speaker diversity with only 12 unique voices
Potential for gender or age imbalance within the small speaker set

Provenance

Source: Mozilla Common Voice 19.0
Collection Method: Filtered and cleaned subset of crowdsourced audio
Freshness: Based on Common Voice 19.0, last updated March 2026.
Geography: Georgia

The dataset is released under a CC-0 license and is provided in Parquet format via Hugging Face.

Parquet Size Categories10 Kn100 K Common Voice Text To Speech Task Categoriestext To Speech Librarypolars Librarydask Task Categoriesaudio To Audio Georgian Licensecc0 10 Speech Synthesis Modalitytext Librarymlcroissant Librarydatasets Regionus

Related Datasets

Quality Score

C41

Description

Source

Reputation

Quality Score

C41

Description

Source

Reputation

Access

Community

117 downloads

5 likes

0 views

Dataset Info

Author: NMikka
Created: Mar 9, 2026
Updated: Mar 11, 2026
Last synced: Apr 18, 2026

Access

Community

117 downloads

5 likes

0 views

Dataset Info

Author: NMikka
Created: Mar 9, 2026
Updated: Mar 11, 2026
Last synced: Apr 18, 2026

Common Voice Geo Cleaned: 35 Hours of Georgian Speech for TTS

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info