Sign in to view source links and access this dataset
Description
The Darija Speech To Text Dataset is a collection of 13,178 rows of transcribed speech audio totaling 8.23 GB, created by ayoubkirouane. It was last updated on 2024-07-18 and focuses on the Darija dialect, primarily from Algeria and Morocco, with slang from other Arabic-speaking countries.
Use Cases
Train automatic speech recognition (ASR) models based on the described audio and transcription pairs.
Fine-tune pre-trained speech models for Darija dialect comprehension based on the described dialectal focus.
Benchmark ASR system performance on colloquial Arabic speech based on the described dataset size and content.
Study linguistic variations and slang across Arabic-speaking regions based on the described inclusion of multiple dialects.
Strengths
Contains 13,178 transcribed audio samples, providing a substantial base for model training.
Audio data totals 8.23 GB, indicating a significant volume of speech material.
Focuses on specific Darija dialects (Algerian, Moroccan) and includes slang, offering targeted linguistic variety.
Limitations
Column-level documentation is absent; field semantics must be inferred after download.
Description metadata is limited; actual data quality requires manual inspection after download.
The dataset's specific collection methodology and potential biases are not detailed in the provided description.
Provenance
Source
ayoubkirouane on Hugging Face
Collection Method
Meticulously gathered from diverse resources, according to the description.
Freshness
Last updated 2024-07-18 15:04:07; freshness should be verified.
Geography
Primarily Algeria and Morocco, with slang from other Arabic-speaking countries.
License is unknown; users should verify terms of use before downloading.