Name: Arabic Tashkeel Speech: 1,093 Diacritized Recordings from 10 Speakers
Creator: NahwAI
Published: 2026-04-21T17:02:43
Keywords: Arabic Speech, Audio, Audio Corpus, Diacritization, Speech Recognition

Description

An open-source collection of 1,093 fully diacritized Arabic speech recordings, crowd-sourced from native speakers via Nahw.ai. The dataset contains audio recordings resampled to 16 kHz paired with their fully diacritized transcriptions. It was created by NahwAI and last updated on 2026-04-21.

Use Cases

Training automatic speech recognition (ASR) models for Arabic based on the audio recordings and transcriptions.
Developing and benchmarking text-to-speech (TTS) systems for Arabic using the diacritized sentences and corresponding speech.
Building or evaluating Arabic diacritization (tashkeel) models using the fully vocalized text data.
Studying Arabic phonetics and pronunciation variation across the 10 different native speakers.

Strengths

Contains 1,093 fully diacritized speech recordings, providing a substantial corpus for Arabic language tasks.
Includes data from 10 different native speakers, which may capture some speaker diversity.
Audio is consistently sampled at 16 kHz, a standard rate for speech processing.

Limitations

Row count is unknown, which may limit suitability assessment.
Column-level documentation is absent; field semantics must be inferred after download.
Data may reflect geographic or demographic bias inherent to the crowd-sourcing platform and its participants.

Provenance

Source: NahwAI via Hugging Face.
Collection Method: Crowd-sourced from native speakers via Nahw.ai.
Time Range: null
Freshness: Last updated 2026-04-21 17:25:02; freshness should be verified.
Geography: null

License is listed as CC-BY-4.0 in the summary table, but the input field states 'unknown'; users should verify the license terms on the dataset page.

Audio Arabic Speech Audio Corpus Diacritization Speech Recognition

Arabic Tashkeel Speech: 1,093 Diacritized Recordings from 10 Speakers

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info