Name: Japanese Speech Dataset With 380 Speakers And 1.2 Million Samples
Creator: tts-dataset
Published: 2026-01-02T14:49:31
Keywords: Task Categoriestext To Speech, Licenseother, Size Categories1 Mn10 M, Librarywebdataset, Modalitytext, Librarymlcroissant, WEBDATASET, Librarydatasets, Regionus, Task Categoriesautomatic Speech Recognition

Description

Filtered GOL Dataset is a Japanese text-to-speech resource containing approximately 1.2 million audio samples totaling 1,880 hours from 380 speakers. It was filtered by tts-dataset for TTS training, applying rules on text length, audio duration, and speaker minimums. The audio is in FLAC format at 44.1kHz and is packaged as a WebDataset.

Use Cases

Train a multi-speaker TTS model using the 380 distinct speaker identities and their associated FLAC audio.
Fine-tune an ASR model on Japanese speech using the 1.2 million samples of filtered text and audio pairs.
Analyze speaker characteristics or build a voice cloning system leveraging the dataset's requirement of at least 5 hours of audio per speaker.
Preprocess and filter raw speech data by applying similar rules on text length (3+ characters) and audio duration (1-60 seconds).

Strengths

Large scale with approximately 1.2 million samples and 1,880 total hours of audio.
Includes 380 speakers, each contributing a minimum of 5 hours of data, supporting diverse voice modeling.
Applied rigorous filtering to exclude non-linguistic text, emoticons, and repetitive characters, improving data quality for TTS.

Limitations

Audio is limited to monaural FLAC format at 44.1kHz, which may not be suitable for applications requiring stereo or different sample rates.
The dataset is exclusively in Japanese (Language:ja), limiting its utility for multilingual TTS research.
Filtering conditions, such as capping audio at under 60 seconds, exclude longer-form speech samples.

Provenance

Source: Filtered version of the midralab/gol-dataset from Hugging Face, curated by tts-dataset.
Collection Method: Filtered from an original dataset using text length, audio duration, speaker minimums, and text content rules.
Freshness: Last updated on 2026-01-15.
Geography: Region is tagged as 'us', but primary language is Japanese (ja), indicating potential geographic bias or mixed provenance.

Dataset is approximately 280GB in size and packaged in the WebDataset (.tar) format, which requires specific libraries for loading. License information is not provided in the input.

WEBDATASET Task Categoriestext To Speech Licenseother Size Categories1 Mn10 M Librarywebdataset Modalitytext Librarymlcroissant Librarydatasets Regionus Task Categoriesautomatic Speech Recognition

Japanese Speech Dataset With 380 Speakers And 1.2 Million Samples

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info