Name: SonoroNova-ES: Large-Scale Synthetic English-to-Spanish Speech Translation
Creator: SonoroNova-ES
Published: 2026-05-03T05:50:25
Keywords: English Spanish, Audio, Large Scale, Natural Language Processing, Speech Translation, Synthetic Speech, Synthetic, Audio Synthesis

Description

SonoroNova-ES is a large-scale synthetic English-to-Spanish speech-to-speech translation dataset containing 329,764 utterances. It was constructed via cascade pipelines combining text-to-text translation models with neural text-to-speech engines, using source audio derived from the HiFiTTS-2 English audiobook corpus. The dataset features 1,315 unique speakers and provides a total of 961 hours of audio.

Use Cases

Train speech-to-speech translation models based on the described synthetic translation pipeline.
Benchmark English-to-Spanish translation quality using the 961 hours of paired audio.
Develop multi-speaker text-to-speech systems leveraging the 1,315 unique speaker voices.
Study the quality and characteristics of synthetic speech data generated via cascade TTT and TTS models.

Strengths

Large scale with 329,764 utterances.
Includes audio from 1,315 unique speakers.
Provides a substantial 961 hours of total audio duration.

Limitations

Description metadata is limited; actual data quality requires manual inspection after download.
Column-level documentation is absent; field semantics must be inferred after download.
Data may reflect source bias inherent to the HiFiTTS-2 audiobook corpus.

Provenance

Source: HiFiTTS-2 English audiobook corpus.
Collection Method: Constructed via cascade pipelines combining open-weight text-to-text translation models with neural text-to-speech engines.
Freshness: Last updated 2026-05-07 00:15:15; freshness should be verified.

License is unknown; terms of use must be verified before application.

Audio English Spanish Large Scale Natural Language Processing Speech Translation Synthetic Speech Synthetic Audio Synthesis

SonoroNova-ES: Large-Scale Synthetic English-to-Spanish Speech Translation

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info