Name: MLS Eng Tokens: Pre-tokenized Audio Codec Tokens for TTS Training
Creator: somu9
Published: 2026-05-11T15:32:25
Keywords: Text To Speech, Machine Learning, Tokens, Speech Synthesis, Mls Corpus, Text, Multilingual, Audio, Natural Language Processing, Audio Tokens

Description

somu9's mls_eng_tokens dataset provides pre-extracted audio codec tokens from the Multilingual LibriSpeech English corpus, tokenized using MOSS-Audio-Tokenizer. The dataset includes train, dev, and test splits and was last updated on 2026-05-17. Audio is processed at a 48,000 Hz sample rate and a 12.5 Hz frame rate.

Use Cases

Train text-to-speech models based on pre-extracted audio codec tokens.
Benchmark speech synthesis systems based on the standardized MLS English corpus.
Experiment with residual vector quantization (RVQ) codebook structures based on the 16-codebook architecture.
Fine-tune audio generation models using tokens with a defined 12.5 Hz frame rate.

Strengths

Tokens are derived from the established Multilingual LibriSpeech (MLS) English dataset.
Audio is processed with a specific codec configuration: 48,000 Hz sample rate, 12.5 Hz frame rate, and 16 RVQ codebooks.
Dataset includes standard machine learning splits: train, dev, and test.

Limitations

Column-level documentation is absent; field semantics must be inferred after download.
Row count, file formats, and license information are unknown, which may limit suitability assessment.

Provenance

Source: parler-tts/mls_eng (Multilingual LibriSpeech English dataset)
Collection Method: Pre-extracted and tokenized using MOSS-Audio-Tokenizer.
Time Range: null
Freshness: Last updated 2026-05-17 05:19:43; freshness should be verified.
Geography: null

License is unknown; users must verify permissions before use.

Text Audio Multilingual Text To Speech Machine Learning Tokens Speech Synthesis Mls Corpus Natural Language Processing Audio Tokens

MLS Eng Tokens: Pre-tokenized Audio Codec Tokens for TTS Training

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info