Vox Classica is a Latin speech corpus of approximately 73 hours of audio, segmented into short clips by sentence. It is a large-scale, ML-ready dataset of human-read Classical Latin designed to address the absence of a publicly available corpus large enough for model training. The dataset was curated by Kaiyuan Zhao and published by Ken-Z.
Use Cases
- Train speech recognition models based on the corpus of human-read Latin sentences.
- Evaluate speech synthesis models using the segmented audio clips.
- Develop language learning applications leveraging the Classical Latin audio.
- Research phonetic and prosodic features of Classical Latin from recorded speech.
Strengths
- Approximately 73 hours of audio provides a substantial volume for model training.
- Segmentation into sentence-level clips suggests structured, ML-ready data.
- The dataset was explicitly designed to fill a gap in publicly available human-read Latin corpora.
Limitations
- Column-level documentation is absent; field semantics must be inferred after download.
- Row count is unknown, which may limit suitability assessment.
- Freshness should be verified as the last update timestamp is 2026-05-14.
Provenance
- Source
- Ken-Z (author) via Hugging Face.
- Collection Method
- Human-read audio, curated and aligned by Kaiyuan Zhao.
- Freshness
- Last updated 2026-05-14 17:54:44.