Sign in to view source links and access this dataset
Description
VoxBox is a curated collection of bilingual speech corpora annotated with clean transcriptions and metadata. The dataset was created by SparkAudio and was last updated on April 15, 2025. It includes audio files and JSONL metadata files organized by sub-corpus, such as aishell-3, casia, commonvoice_cn, and wenetspeech4tts.
Use Cases
Train automatic speech recognition models based on the clean transcriptions mentioned in the description.
Develop text-to-speech synthesis systems based on the bilingual audio corpora.
Analyze speech patterns based on the included metadata such as age, gender, and emotion.
Build multilingual speech processing pipelines based on the bilingual nature of the dataset.
Strengths
Curated collection of multiple established speech corpora, including aishell-3 and wenetspeech4tts.
Includes metadata attributes such as age, gender, and emotion for each speech sample.
Provides clean transcriptions for the audio data.
Limitations
Row count is unknown, which may limit suitability assessment.
Column-level documentation is absent; field semantics must be inferred after download.
Description metadata is limited; actual data quality requires manual inspection after download.
Provenance
Source
SparkAudio
Collection Method
Curated collection from multiple bilingual speech corpora.
Freshness
Last updated 2025-04-15 07:43:25; freshness should be verified.
License is unknown and should be verified before use.