Description

Kazakhstan's largest open Kazakh speech corpus, an extended version of the ISSAI KSC2 dataset from Nazarbayev University. It contains approximately 1,110 hours of audio across 595,690 recordings, enhanced with punctuation and word-level timestamps from MFA alignment. The dataset is published and maintained by Jeti Labs.

Use Cases

Train automatic speech recognition (ASR) models based on the 1,110 hours of Kazakh audio.
Develop punctuation restoration models for transcribed speech based on the added punctuation annotations.
Build text-to-speech alignment systems using the word-level temporal metadata (MFA alignment).
Conduct linguistic research on Kazakh phonetics and prosody based on the aligned speech corpus.

Strengths

Contains 595,690 audio recordings, providing a substantial volume of data.
Offers approximately 1,110 hours of speech, making it a large-scale resource for a low-resource language.
Enhanced with punctuation and word-level timestamps, adding valuable structure beyond raw audio.

Limitations

Column-level documentation is absent; field semantics must be inferred after download.
Description metadata is limited; actual data quality requires manual inspection after download.

Provenance

Source: Extended from the ISSAI KSC2 corpus by Nazarbayev University, published by Jeti Labs.
Collection Method: Likely derived from a speech corpus with automated alignment (MFA) and punctuation annotation.
Freshness: Last updated 2026-04-16 05:37:14; freshness should be verified.
Geography: Kazakhstan (implied by Kazakh language focus).

The dataset size is listed as 52.9 GB, which is substantial for download and storage.

Audio Multimodal Speech Corpus Kazakh Speech Audio Alignment Punctuation Restoration Kazakh Language Speech Recognition

Kazakh Speech MFA Punctuation: 1,110 Hours of Annotated Audio

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info