5,122 training examples of Armenian dialect speech recordings from Artsakh varieties (Stepanakert, Getashen, Hadrut) paired with aligned transcriptions, split into train, validation, and test subsets. The dataset was created by DALiH-ANR and last updated on February 17, 2026.
Use Cases
- Train speech-to-text models for Armenian dialects based on the audio-transcription pairs.
- Benchmark automatic speech recognition systems on the provided test subset.
- Analyze phonetic or lexical variation across the Stepanakert, Getashen, and Hadrut dialect subsets.
- Develop language models or tools for the Artsakh Armenian language community.
Strengths
- Contains 5,122 training examples, providing a substantial base for model training.
- Includes 140 validation and 175 test examples for evaluation.
- Covers three distinct Artsakh dialect varieties (Stepanakert, Getashen, Hadrut).
Limitations
- Column-level documentation is absent; field semantics must be inferred after download.
- Row count for the full dataset is unknown, which may limit suitability assessment.
- Description metadata is limited; actual data quality requires manual inspection after download.
Provenance
- Source
- DALiH-ANR
- Freshness
- Last updated 2026-02-17 09:28:18; freshness should be verified.
- Geography
- Artsakh region (Stepanakert, Getashen, Hadrut)